apache / incubator-stormcrawler

A scalable, mature and versatile web crawler based on Apache Storm
https://stormcrawler.apache.org/
Apache License 2.0
887 stars 262 forks source link

[URLFrontier] URLFrontier extension not returning ID preventing Status-ACK making crawling impossible #981

Closed FelixEngl closed 2 years ago

FelixEngl commented 2 years ago

Hello @jnioche,

we switched our URL and Status handling from a custom bolt to URL-frontier. But I recognized, that the Status-Bold is not acking any tuple. After going into the code, adding some log-events and cleaning up the async-code and improving the state-management, my unit test shows the following log:

12:35:05.187 [Time-limited test] INFO  c.d.s.u.StatusUpdaterBolt - Initialisation of connection to URLFrontier service on localhost:53770
12:35:05.187 [Time-limited test] INFO  c.d.s.u.StatusUpdaterBolt - Allowing up to 100000 message in flight
12:35:05.194 [Time-limited test] ERROR c.d.s.u.PartitionUtil - Unknown partition mode : null - forcing to byHost
12:35:05.194 [Time-limited test] INFO  c.d.s.u.URLPartitioner - Using partition mode : QUEUE_MODE_HOST
12:35:05.263 [Time-limited test] TRACE c.d.s.u.StatusUpdaterBolt - Added to waitAck https://www.url.net/something with ID https://www.url.net/something total 1 - sent to localhost:53770
12:35:05.751 [grpc-default-executor-1] WARN  c.d.s.u.StatusUpdaterBolt - Could not find unacked tuple for blank id ``. (Ack: )
12:35:05.752 [grpc-default-executor-1] TRACE c.d.s.u.StatusUpdaterBolt - Trace for unpacked tuple for blank id: 
12:35:10.787 [Time-limited test] INFO  c.d.s.u.ChannelManager - Shutting down channel ManagedChannelOrphanWrapper{delegate=ManagedChannelImpl{logId=1, target=localhost:53770}}

It looks like URL-Frontier does not provide an ID when responding to a put, this is an error that can not be fixed on the SC side. Without the ID the Status won't be able to ACK a single tuple, making crawling basically impossible.

I added the used Unit-Tests ect. in this PR: https://github.com/DigitalPebble/storm-crawler/pull/980

Best Regards

Felix

jnioche commented 2 years ago

thanks @FelixEngl I will look at #980 but if there is a problem with URLFrontier the issue should have been opened there. Could you clarify which version of URLFrontier you are using, how you are starting it etc... ?

FelixEngl commented 2 years ago

Ah sorry, I really should've opened it there. I'll open one there immediately and link to this issue. (tbh. I was so happy that I found that problem after 3 days of searching, that this totally slipped my mind. ^^')

About my Configuration: I'm starting version 2.1 in a compose file. Uploading the URLs and connecting to the Frontier are no problem. In the Unit-Tests of the PR I'm just starting a docker container with default parameters. I also created a Frontier-Testcontainer. That should help us to write even more tests when this one works. (Maybe I should put that URL-Frontier-Testcontainer in a separate repository, commit it to the testcontainer-repository or make a PR at the URL-Frontier repository. What do you think would be the best approach?)

jnioche commented 2 years ago

thanks @FelixEngl There already is a test suite in URLFrontier. If we have a test that can reproduce the issue, it should be added there. Having a test container in the URLFrontier module could also be useful for testing that the related components in StormCrawler behave as expected.

jnioche commented 2 years ago

I think the underlying issue has already been fixed in URLFrontier

https://github.com/crawler-commons/url-frontier/commit/ced0150d3a516ba8c8ad94b362fb8960ab2b35d6

I will release a new version of URLFrontier shortly.

FelixEngl commented 2 years ago

thanks @FelixEngl There already is a test suite in URLFrontier. If we have a test that can reproduce the issue, it should be added there. Having a test container in the URLFrontier module could also be useful for testing that the related components in StormCrawler behave as expected.

Then I'll add some more tests after https://github.com/DigitalPebble/storm-crawler/pull/980 was ACKed. I already wrote aome ideas for useful tests down.

jnioche commented 2 years ago

Will close this issue as the underlying problem has been fixed already