apache / incubator-stormcrawler

A scalable, mature and versatile web crawler based on Apache Storm
https://stormcrawler.apache.org/
Apache License 2.0
886 stars 262 forks source link

URLFrontier spout doesn't take into account crawl Id #1353

Open klockla opened 1 week ago

klockla commented 1 week ago

The URLFrontier Spout ( https://github.com/apache/incubator-stormcrawler/blob/main/external/urlfrontier/src/main/java/org/apache/stormcrawler/urlfrontier/Spout.java ) doesn't take into account the crawl Id that can be specified in the configuration parameters (URLFRONTIER_CRAWL_ID_KEY = "urlfrontier.crawlid" defined in org.apache.stormcrawler.urlfrontier.Constants)

This results in a mix of URLs coming from distinct frontiers in URLFrontier.

rzo1 commented 1 week ago

Would you Like to submit a PR with a fix?

klockla commented 1 week ago

Would you Like to submit a PR with a fix?

Yes, I will but won't be able to do before 2 weeks.