apache / incubator-stormcrawler

A scalable, mature and versatile web crawler based on Apache Storm
https://stormcrawler.apache.org/
Apache License 2.0
885 stars 262 forks source link

Elasticsearch IndexerBolt not being acked correctly causing failures in spout #800

Closed matiascrespof closed 4 years ago

matiascrespof commented 4 years ago

We found with @jcruzmartini that elasticsearch Indexer is bolt acking before emit tuples in afterBulk method is causing ack failures in spout after timeout set in topology.

Proposed solution is change order of emit / ack in com.digitalpebble.stormcrawler.elasticsearch.bolt.IndexerBolt :

if (!failed) {
                    acked++;
                    _collector.emit(StatusStreamName, t, new Values(u,
                            metadata, Status.FETCHED));
                    _collector.ack(t);
                } else {
...
...

After migrate from 1.13 to 1.16 we noticed bad performance in our crawler, and also a lot of failures in the spout, after add IndexerBolt class in our project with that modification it started working correctly with great performance

@jnioche we can create a pull request with a simple change in that class if you want

Thanks! Matias

jnioche commented 4 years ago

Hi Matias

Thanks and yes please open a PR.

That would mean that they get failed because they were treated as non acked in time. Do you know if they actually make it to the statusupdater bolt at all? This probably happens on a fraction of the tuples. I suppose this would affect the performance as the spouts would not release more URLs tp process.

thanks again!

On Thu, 28 May 2020 at 16:20, Matias Crespo notifications@github.com wrote:

We found with @jcruzmartini https://github.com/jcruzmartini that elasticsearch Indexer is bolt acking before emit tuples in afterBulk method is causing ack failures in spout after timeout set in topology.

Proposed solution is change order of emit / ack in com.digitalpebble.stormcrawler.elasticsearch.bolt.IndexerBolt :

if (!failed) { acked++; _collector.emit(StatusStreamName, t, new Values(u, metadata, Status.FETCHED)); _collector.ack(t); } else { ... ...

After migrate from 1.13 to 1.16 we noticed bad performance in our crawler, and also a lot of failures in the spout, after add IndexerBolt class in our project with that modification it started working correctly with great performance

@jnioche https://github.com/jnioche we can create a pull request with a simple change in that class if you want

Thanks! Matias

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/DigitalPebble/storm-crawler/issues/800, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABVJT7R7THPG3622XDOXQ3RTZ6KRANCNFSM4NNF3DWA .

--

matiascrespof commented 4 years ago

Hi @jnioche , thanks for your quick reply!

Exactly, they were treated as non acked and is only a fraction of the tuples. the tuples are failing after topology reach time out set in topology.message.timeout.secs. They actually make it to the statusupdater , but at some point it stops since it reach the topology.max.spout.pending, so the topology gets stuck.

We are going to create a PR today.

Thanks for your help!

jcruzmartini commented 4 years ago

thanks for your quick reply @jnioche I think this issue is affecting all the users that are using stormcrawler with ES, it's surprising for me why this was not reported before.

jnioche commented 4 years ago

changed the title a bit as it is not about the anchoring as such

@jcruzmartini it wasn't an easy one to spot, but you and @matiascrespof have great detective skills ;-)

Thanks again for reporting it and submitting a PR. I'll go through all the acks to see if this happens anywhere else

jnioche commented 4 years ago

Fixed by #801