algolia / docsearch-configs

DocSearch - Configurations
https://docsearch.algolia.com/
MIT License
457 stars 1.13k forks source link

Clean URLs are not indexed #429

Closed axilleas closed 6 years ago

axilleas commented 6 years ago

Do you want to request a feature or report a bug?

Bug.

If it is a DocSearch index issue, what is the related index_name ?

index_name= gitlab.json

What is the current behaviour?

The sitemap contains clean URLs (no index.html extension), and the crowler doesn't pick them. You can see for example that https://docs.gitlab.com/ee/user/group/saml_sso/ is in the sitemap, but it's not indexed.

What is the expected behaviour?

Clean URLs should be indexed as well.

What have you tried to solve it?

Nothing yet.

Any quick clues?

sitemap_urls_regexs is not defined, so the start_urls should be used as a pattern.

Maybe the stop regex has something to do with it? https://github.com/algolia/docsearch-configs/blob/f6933d4610515b21eccc00b957a5d75bf4e8dac2/configs/gitlab.json#L43

axilleas commented 6 years ago

cc @s-pace @ramosmd

s-pace commented 6 years ago

👋 @axilleas,

This is not a bug but it is expected since one of the stop_urls is .*^(?!.*html).

Pages must have a .html extension in order to be scrapped.

I must have set this one because of duplicates. I will try to remove it but I am pretty such we will have issue like: <URL page root> displaying the same content as <URL page root>index.html

s-pace commented 6 years ago

It did introduce duplicates:

https://docs.gitlab.com/ee/development/i18n/ & https://docs.gitlab.com/ee/development/i18n/index.html

s-pace commented 6 years ago

Could you avoid this ?

axilleas commented 6 years ago

@s-pace ah ok! I'll see if I can fix the sitemap to append the index.html then, thanks.

s-pace commented 6 years ago

Nice and having a redirection status instead would help too