Closed matiascrespof closed 4 years ago
Hi Matias
Thanks and yes please open a PR.
That would mean that they get failed because they were treated as non acked in time. Do you know if they actually make it to the statusupdater bolt at all? This probably happens on a fraction of the tuples. I suppose this would affect the performance as the spouts would not release more URLs tp process.
thanks again!
On Thu, 28 May 2020 at 16:20, Matias Crespo notifications@github.com wrote:
We found with @jcruzmartini https://github.com/jcruzmartini that elasticsearch Indexer is bolt acking before emit tuples in afterBulk method is causing ack failures in spout after timeout set in topology.
Proposed solution is change order of emit / ack in com.digitalpebble.stormcrawler.elasticsearch.bolt.IndexerBolt :
if (!failed) { acked++; _collector.emit(StatusStreamName, t, new Values(u, metadata, Status.FETCHED)); _collector.ack(t); } else { ... ...
After migrate from 1.13 to 1.16 we noticed bad performance in our crawler, and also a lot of failures in the spout, after add IndexerBolt class in our project with that modification it started working correctly with great performance
@jnioche https://github.com/jnioche we can create a pull request with a simple change in that class if you want
Thanks! Matias
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/DigitalPebble/storm-crawler/issues/800, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABVJT7R7THPG3622XDOXQ3RTZ6KRANCNFSM4NNF3DWA .
--
Hi @jnioche , thanks for your quick reply!
Exactly, they were treated as non acked and is only a fraction of the tuples. the tuples are failing after topology reach time out set in topology.message.timeout.secs. They actually make it to the statusupdater , but at some point it stops since it reach the topology.max.spout.pending, so the topology gets stuck.
We are going to create a PR today.
Thanks for your help!
thanks for your quick reply @jnioche I think this issue is affecting all the users that are using stormcrawler with ES, it's surprising for me why this was not reported before.
changed the title a bit as it is not about the anchoring as such
@jcruzmartini it wasn't an easy one to spot, but you and @matiascrespof have great detective skills ;-)
Thanks again for reporting it and submitting a PR. I'll go through all the acks to see if this happens anywhere else
Fixed by #801
We found with @jcruzmartini that elasticsearch Indexer is bolt acking before emit tuples in afterBulk method is causing ack failures in spout after timeout set in topology.
Proposed solution is change order of emit / ack in com.digitalpebble.stormcrawler.elasticsearch.bolt.IndexerBolt :
After migrate from 1.13 to 1.16 we noticed bad performance in our crawler, and also a lot of failures in the spout, after add IndexerBolt class in our project with that modification it started working correctly with great performance
@jnioche we can create a pull request with a simple change in that class if you want
Thanks! Matias