flairNLP / fundus

A very simple news crawler with a funny name
MIT License
97 stars 59 forks source link

Clean URLs before inserting into response_cache #504

Closed addie9800 closed 2 weeks ago

addie9800 commented 2 weeks ago

Depending on the Source of the URL (RSSFeed or Sitemap) Publishers tend to sometimes add a parameter to the url indicating the origin. E.g. Kicker appends #rssom. If max_articles is set to a large enough value, this may lead to the same article being crawled twice, since the response_cache distinguishes the two URLs