Open klockla opened 1 week ago
The URLFrontier Spout ( https://github.com/apache/incubator-stormcrawler/blob/main/external/urlfrontier/src/main/java/org/apache/stormcrawler/urlfrontier/Spout.java ) doesn't take into account the crawl Id that can be specified in the configuration parameters (URLFRONTIER_CRAWL_ID_KEY = "urlfrontier.crawlid" defined in org.apache.stormcrawler.urlfrontier.Constants)
This results in a mix of URLs coming from distinct frontiers in URLFrontier.
Would you Like to submit a PR with a fix?
Yes, I will but won't be able to do before 2 weeks.
The URLFrontier Spout ( https://github.com/apache/incubator-stormcrawler/blob/main/external/urlfrontier/src/main/java/org/apache/stormcrawler/urlfrontier/Spout.java ) doesn't take into account the crawl Id that can be specified in the configuration parameters (URLFRONTIER_CRAWL_ID_KEY = "urlfrontier.crawlid" defined in org.apache.stormcrawler.urlfrontier.Constants)
This results in a mix of URLs coming from distinct frontiers in URLFrontier.