GateNLP / ultimate-sitemap-parser

Ultimate Website Sitemap Parser
https://mediacloud.org/
Other
182 stars 64 forks source link

Reduce recursivity level for sitemap fetcher #5

Open pypt opened 6 years ago

pypt commented 6 years ago

10 levels deep is probably too much:

2018-11-26 13:11:19,139 INFO mediawords.util.sitemap.helpers
[162086/MainThread]: Fetching URL
https://www.juiceplus.com/fr/fr/franchise/sitemap.xml...
2018-11-26 13:11:19,428 INFO mediawords.util.sitemap.fetchers
[162086/MainThread]: Parsing sitemap from URL
https://www.juiceplus.com/fr/fr/franchise/sitemap.xml...
2018-11-26 13:11:19,508 INFO mediawords.util.sitemap.fetchers
[162086/MainThread]: Fetching level 8 sitemap from
https://www.juiceplus.com/il/en/franchise/sitemap.xml...
2018-11-26 13:11:19,508 INFO mediawords.util.sitemap.helpers
[162086/MainThread]: Fetching URL
https://www.juiceplus.com/il/en/franchise/sitemap.xml...
nubonics commented 4 years ago

No. The purpose of a sitemap is to show every single page on the website, lowering the depth would result in an invalid sitemap extraction. I completely disagree that this is a bug.