Closed axilleas closed 6 years ago
cc @s-pace @ramosmd
👋 @axilleas,
This is not a bug but it is expected since one of the stop_urls
is .*^(?!.*html)
.
Pages must have a .html
extension in order to be scrapped.
I must have set this one because of duplicates. I will try to remove it but I am pretty such we will have issue like:
<URL page root>
displaying the same content as <URL page root>index.html
It did introduce duplicates:
https://docs.gitlab.com/ee/development/i18n/ & https://docs.gitlab.com/ee/development/i18n/index.html
Could you avoid this ?
@s-pace ah ok! I'll see if I can fix the sitemap to append the index.html
then, thanks.
Nice and having a redirection status instead would help too
Do you want to request a feature or report a bug?
Bug.
If it is a DocSearch index issue, what is the related
index_name
?index_name
= gitlab.jsonWhat is the current behaviour?
The sitemap contains clean URLs (no
index.html
extension), and the crowler doesn't pick them. You can see for example that https://docs.gitlab.com/ee/user/group/saml_sso/ is in the sitemap, but it's not indexed.What is the expected behaviour?
Clean URLs should be indexed as well.
What have you tried to solve it?
Nothing yet.
Any quick clues?
sitemap_urls_regexs
is not defined, so thestart_urls
should be used as a pattern.Maybe the stop regex has something to do with it? https://github.com/algolia/docsearch-configs/blob/f6933d4610515b21eccc00b957a5d75bf4e8dac2/configs/gitlab.json#L43