How to de-dup pages? - Githubissues

algolia / docsearch-scraper

DocSearch - Scraper

https://docsearch.algolia.com/

Other

305 stars 106 forks source link

How to de-dup pages? #550

Closed lorensr closed 3 years ago

lorensr commented 3 years ago

When I configure with "start_urls": ["https://graphql.guide/preface"],, the scraper picks up /preface and /preface/ as separate pages, and includes them separately in search results. However, they are the same page. How can I de-duplicate them?

> docker run -it --env-file=.env.docsearch -e "CONFIG=$(cat docsearch.json | jq -r tostring)" algolia/docsearch-scraper

> DocSearch: https://graphql.guide/preface 10 records)
> DocSearch: https://graphql.guide/preface/ 10 records)

shortcuts commented 3 years ago

Hi @lorensr,

You can chose to stop the scraper for either of the two patterns.

Using the stop_urls, you can add any regex you'd like to use: stop URLs with a trailing slash "stop_urls": ["/$"] or without "stop_urls": [".*(?<!/)$"]

Hope this answers your question!