apache / incubator-stormcrawler

A scalable, mature and versatile web crawler based on Apache Storm
https://stormcrawler.apache.org/
Apache License 2.0
887 stars 262 forks source link

Issue with the order of emit and emitOutlink for redirections in FetcherBolt #954

Closed juli-alvarez closed 2 years ago

juli-alvarez commented 2 years ago

What kind of issue is this?

Thanks!

Hi @jnioche. I've found an issue in the fetcher bolt . When a redirection takes place, the outlink (that belongs to the page in the redirTo) is being processed after the update to redirection status of the current url is emitted to the status updater bolt. For those cases where the fetcher and the status updater bolts are in the same worker, the metadata instance is the same which is causing unexpected behavior, like metadata keys being lost, when we are creating the metadata for the child based on the parent's in the metadata transfer class. The issue is easily fixed by changing the order in the fetcher bolt to parse the outlink first and then emit the update of the parent url, the same way it's handled for redirections in the jsoup bolt. If you want I can create a PR for it. Thanks.

jnioche commented 2 years ago

hi @juli-alvarez, thanks for reporting this. This is definitely a bug which might affect quite a few users. We should copy the metadata anyway so that any alterations in a tuple would not affect the other, regardless of order. We could change the order as you suggested to be consistent with the way it is done in the parser. A PR would be greatly appreciated! Thanks again

juli-alvarez commented 2 years ago

Hi @jnioche. What do you think about copying parent's metadata inside the emitOutlink method? This way we cover fetcher, simple fetcher and jsoup parser bolts. Also, I've seen in the fetcher bolts the emitOutlink is also called for sitemaps, so we cover that as well.. The only downside I see is if anyone overrides the method and removes the copy then the issue will appear again.

jnioche commented 2 years ago

Judging by the code in StatusEmitterBolt, the metadata for the outlink is created by the MetadataTransfer. It is a brand new one and should not be the one from the original URL. So the problem you describe should not happen really. Do you have a unit test to reproduce it?

juli-alvarez commented 2 years ago

Hi @jnioche . Yes, that's true, but the issue only arises in the fetcher where the parent url is being emitted before the outlink is processed. In my case, when the fetcher and the status bolts were in the same worker, the instance used in the metadata transfer to create the outlink's metadata was the same. So, if parent's metadata was modified in the status bolt it caused weird issues when creating the outlink in the fetcher, like keys missing randomly. That's why changing the order fixed the issue. Probably copying the parent's metadata is redundant if the order is changed because as you said, outlink metadata is a new instance.

jnioche commented 2 years ago

ok let's change the order in the FetcherBolt. Can you send a PR with a comment in the code referencing this issue so that if anyone looks at it in the future they can be aware that the order matters. Thanks!