apache / incubator-stormcrawler

A scalable, mature and versatile web crawler based on Apache Storm
https://stormcrawler.apache.org/
Apache License 2.0
887 stars 262 forks source link

Issue with ConcurrentModificationException for Metadata in StatusMetricsBolt #909

Closed juli-alvarez closed 3 years ago

juli-alvarez commented 3 years ago

We have been using stormcrawler with elasticsearch under huge load using parallelism across multiple workers for some time without any issues. After we upgraded the version from 1.18 to 2.1 (storm 1.2.3 to 2.2.0 as well) we started getting ConcurrentModificationException from kryo for Metadata class.

2021-09-14 17:13:48.797 o.a.s.e.e.ReportError Thread-20-spout-executor[59, 59] [ERROR] Error
java.lang.RuntimeException: com.esotericsoftware.kryo.KryoException: java.util.ConcurrentModificationException
Serialization trace:
md (com.digitalpebble.stormcrawler.Metadata)
    at org.apache.storm.utils.Utils$1.run(Utils.java:409) ~[storm-client-2.2.0.jar:2.2.0]
    at java.lang.Thread.run(Thread.java:748) [?:1.8.0_292]
Caused by: com.esotericsoftware.kryo.KryoException: java.util.ConcurrentModificationException
Serialization trace:
md (com.digitalpebble.stormcrawler.Metadata)
    at com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:101) ~[kryo-3.0.3.jar:?]
    at com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:518) ~[kryo-3.0.3.jar:?]
    at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:628) ~[kryo-3.0.3.jar:?]
    at com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:100) ~[kryo-3.0.3.jar:?]
    at com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:40) ~[kryo-3.0.3.jar:?]
    at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:534) ~[kryo-3.0.3.jar:?]
    at org.apache.storm.serialization.KryoValuesSerializer.serializeInto(KryoValuesSerializer.java:38) ~[storm-client-2.2.0.jar:2.2.0]
    at org.apache.storm.serialization.KryoTupleSerializer.serialize(KryoTupleSerializer.java:40) ~[storm-client-2.2.0.jar:2.2.0]
    at org.apache.storm.daemon.worker.WorkerTransfer.tryTransferRemote(WorkerTransfer.java:116) ~[storm-client-2.2.0.jar:2.2.0]
    at org.apache.storm.daemon.worker.WorkerState.tryTransferRemote(WorkerState.java:524) ~[storm-client-2.2.0.jar:2.2.0]
    at org.apache.storm.executor.ExecutorTransfer.tryTransfer(ExecutorTransfer.java:68) ~[storm-client-2.2.0.jar:2.2.0]
    at org.apache.storm.executor.spout.SpoutOutputCollectorImpl.sendSpoutMsg(SpoutOutputCollectorImpl.java:140) ~[storm-client-2.2.0.jar:2.2.0]
    at org.apache.storm.executor.spout.SpoutOutputCollectorImpl.emit(SpoutOutputCollectorImpl.java:70) ~[storm-client-2.2.0.jar:2.2.0]
    at org.apache.storm.spout.SpoutOutputCollector.emit(SpoutOutputCollector.java:56) ~[storm-client-2.2.0.jar:2.2.0]
    at com.digitalpebble.stormcrawler.persistence.AbstractQueryingSpout.nextTuple(AbstractQueryingSpout.java:197) ~[stormjar.jar:?]
    at org.apache.storm.executor.spout.SpoutExecutor$2.call(SpoutExecutor.java:193) ~[storm-client-2.2.0.jar:2.2.0]
    at org.apache.storm.executor.spout.SpoutExecutor$2.call(SpoutExecutor.java:160) ~[storm-client-2.2.0.jar:2.2.0]
    at org.apache.storm.utils.Utils$1.run(Utils.java:394) ~[storm-client-2.2.0.jar:2.2.0]
    ... 1 more

We did some research and found that the root cause of the issue was that the Metadata object was being mutated after it was emitted. Also we were able to identify that the issue disappeared whenever we removed the status_metrics bolt from the topology. That made sense because the metadata emitted by the spout was modified later in the default stream and was also emitted to the status metrics bolt.

We propose a solution that is connecting the status metrics bolt to the system using the tick stream. That's because the mentioned bolt only uses the tick tuple to perform kind of a cron job. This way we avoid passing the real tuple that is not used at all with the metadata that was causing the issue.

es-crawler.flux

  - from: "__system"
    to: "status_metrics"
    grouping:
      type: SHUFFLE
      streamId: "__tick"

BTW, I'm working with @matiascrespof and @jcruzmartini.

Thanks!

jnioche commented 3 years ago

thanks @juli-alvarez This makes a lot of sense and would be a far better way of doing. I assume you tried the solution above and it worked. Would you like to submit a PR to change the *.flux file?

juli-alvarez commented 3 years ago

Hi @jnioche! Glad you like the approach. Yes, we have the crawler running for the past 24hs and everything is working as expected, no exception, no workers dying and grafana dashboards looking good as well. Sure, I will create the PR with the changes in the flux file.

jnioche commented 3 years ago

Not that I don't want to do it myself: I really want you to take full credit for it ;-)

juli-alvarez commented 3 years ago

@jnioche Done! Thanks Julien 🥇

jnioche commented 3 years ago

Fixed in #910