Closed FelixEngl closed 2 years ago
thanks @FelixEngl I will look at #980 but if there is a problem with URLFrontier the issue should have been opened there. Could you clarify which version of URLFrontier you are using, how you are starting it etc... ?
Ah sorry, I really should've opened it there. I'll open one there immediately and link to this issue. (tbh. I was so happy that I found that problem after 3 days of searching, that this totally slipped my mind. ^^')
About my Configuration: I'm starting version 2.1 in a compose file. Uploading the URLs and connecting to the Frontier are no problem. In the Unit-Tests of the PR I'm just starting a docker container with default parameters. I also created a Frontier-Testcontainer. That should help us to write even more tests when this one works. (Maybe I should put that URL-Frontier-Testcontainer in a separate repository, commit it to the testcontainer-repository or make a PR at the URL-Frontier repository. What do you think would be the best approach?)
thanks @FelixEngl There already is a test suite in URLFrontier. If we have a test that can reproduce the issue, it should be added there. Having a test container in the URLFrontier module could also be useful for testing that the related components in StormCrawler behave as expected.
I think the underlying issue has already been fixed in URLFrontier
https://github.com/crawler-commons/url-frontier/commit/ced0150d3a516ba8c8ad94b362fb8960ab2b35d6
I will release a new version of URLFrontier shortly.
thanks @FelixEngl There already is a test suite in URLFrontier. If we have a test that can reproduce the issue, it should be added there. Having a test container in the URLFrontier module could also be useful for testing that the related components in StormCrawler behave as expected.
Then I'll add some more tests after https://github.com/DigitalPebble/storm-crawler/pull/980 was ACKed. I already wrote aome ideas for useful tests down.
Will close this issue as the underlying problem has been fixed already
Hello @jnioche,
we switched our URL and Status handling from a custom bolt to URL-frontier. But I recognized, that the Status-Bold is not acking any tuple. After going into the code, adding some log-events and cleaning up the async-code and improving the state-management, my unit test shows the following log:
It looks like URL-Frontier does not provide an ID when responding to a put, this is an error that can not be fixed on the SC side. Without the ID the Status won't be able to ACK a single tuple, making crawling basically impossible.
I added the used Unit-Tests ect. in this PR: https://github.com/DigitalPebble/storm-crawler/pull/980
Best Regards
Felix