algolia / docsearch-scraper

DocSearch - Scraper
https://docsearch.algolia.com/
Other
305 stars 106 forks source link

Crawler isn't following links #549

Closed lorensr closed 3 years ago

lorensr commented 3 years ago

I'm using the docker container and this config:

https://github.com/GraphQLGuide/book/blob/411bb46629b622a785312f199053e5c55234608d/docsearch.json

When I run the docker command, I get 303 "nb hits", but they all point to different anchors on the start_url page—none of them are for the other pages linked on the start_url page (https://graphql.guide/contents)

lorensr commented 3 years ago

Posting $50 bounty: https://www.bountysource.com/issues/97706372-crawler-isn-t-following-links

shortcuts commented 3 years ago

Hi @lorensr,

The start_urls are more of "a pattern of URLs the crawler should accept" than "which URL should I start with", which is why other pages are not crawled.

As the contents route doesn't have children, nothing else is found, but if you try with "start_urls": ["https://graphql.guide/vue"], it will work.

One way to solve this issue could be to create a sitemap.xml only for the crawler, so it can follow all the pages inside (doc) Or use a more generic "start_urls": ["https://graphql.guide/"] for example

lorensr commented 3 years ago

Thank you so much! Generic solution worked great ☺️

shortcuts commented 3 years ago

No worries, feel free to close the bounty or give it to a charity of your choice :D

cybersaksham commented 1 year ago

No worries, feel free to close the bounty or give it to a charity of your choice :D

Hey @shortcuts please take a look at my problem here #571 It is not generating hits for child routes example /docs And when I enter the complete URL with the /docs route then it shows ignored start URL