apache / incubator-stormcrawler

A scalable, mature and versatile web crawler based on Apache Storm
https://stormcrawler.apache.org/
Apache License 2.0
886 stars 262 forks source link

Performance degradation using version 2.1.0 #915

Closed jcruzmartini closed 2 years ago

jcruzmartini commented 3 years ago

Performance degradation using version 2.1.0

We have been experiencing performance degradation when using the new version of stormcrawler. We are pretty sure that the issue is being caused by this workaround

    /** Workaround for https://issues.apache.org/jira/projects/STORM/issues/STORM-3582?filter=allopenissues **/
    protected synchronized void emit(String streamId, Tuple anchor, List<Object> tuple) {
        collector.emit(streamId, anchor, tuple);
    }

it's worth mentioning that we are using stormcrawler intensively with a high number of threads (300) so having now this method synchronized is adding extra delays and affecting the performance. Here you can see an example of how it is performing SC 2.1 vs 1.18...

image

left side SC 2.1, then we stopped the crawler and we re-started using SC 1.18. We tried removing the synchronized in the emit method, metrics started to look better but we got this kind of issue, related to this apache storm issue : https://issues.apache.org/jira/browse/STORM-3620

./worker.log-2021-10-15 13:51:32.756 STDERR Thread-2 [INFO] 2021-10-15 13:51:32.752 o.a.s.u.Utils Thread-16-fetcher-executor[10, 10] [ERROR] Async loop died!
./worker.log-2021-10-15 13:51:32.756 STDERR Thread-2 [INFO] com.esotericsoftware.kryo.KryoException: java.util.ConcurrentModificationException
./worker.log-2021-10-15 13:51:32.756 STDERR Thread-2 [INFO] Serialization trace:
./worker.log-2021-10-15 13:51:32.756 STDERR Thread-2 [INFO] md (com.digitalpebble.stormcrawler.Metadata)
./worker.log-2021-10-15 13:51:32.756 STDERR Thread-2 [INFO]     at com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:101) ~[kryo-3.0.3.jar:?]
./worker.log-2021-10-15 13:51:32.756 STDERR Thread-2 [INFO]     at com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:518) ~[kryo-3.0.3.jar:?]
./worker.log-2021-10-15 13:51:32.756 STDERR Thread-2 [INFO]     at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:628) ~[kryo-3.0.3.jar:?]
./worker.log-2021-10-15 13:51:32.756 STDERR Thread-2 [INFO]     at com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:100) ~[kryo-3.0.3.jar:?]
./worker.log-2021-10-15 13:51:32.756 STDERR Thread-2 [INFO]     at com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:40) ~[kryo-3.0.3.jar:?]
./worker.log-2021-10-15 13:51:32.756 STDERR Thread-2 [INFO]     at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:534) ~[kryo-3.0.3.jar:?]
./worker.log-2021-10-15 13:51:32.756 STDERR Thread-2 [INFO]     at org.apache.storm.serialization.KryoValuesSerializer.serializeInto(KryoValuesSerializer.java:38) ~[storm-client-2.2.0.jar:2.2.0]
./worker.log-2021-10-15 13:51:32.756 STDERR Thread-2 [INFO]     at org.apache.storm.serialization.KryoTupleSerializer.serialize(KryoTupleSerializer.java:40) ~[storm-client-2.2.0.jar:2.2.0]
./worker.log-2021-10-15 13:51:32.756 STDERR Thread-2 [INFO]     at org.apache.storm.daemon.worker.WorkerTransfer.tryTransferRemote(WorkerTransfer.java:116) ~[storm-client-2.2.0.jar:2.2.0]
./worker.log-2021-10-15 13:51:32.756 STDERR Thread-2 [INFO]     at org.apache.storm.daemon.worker.WorkerState.tryTransferRemote(WorkerState.java:524) ~[storm-client-2.2.0.jar:2.2.0]
./worker.log-2021-10-15 13:51:32.756 STDERR Thread-2 [INFO]     at org.apache.storm.executor.ExecutorTransfer.tryTransfer(ExecutorTransfer.java:68) ~[storm-client-2.2.0.jar:2.2.0]
./worker.log-2021-10-15 13:51:32.756 STDERR Thread-2 [INFO]     at org.apache.storm.executor.bolt.BoltExecutor$1.tryFlushPendingEmits(BoltExecutor.java:200) ~[storm-client-2.2.0.jar:2.2.0]
./worker.log-2021-10-15 13:51:32.756 STDERR Thread-2 [INFO]     at org.apache.storm.executor.bolt.BoltExecutor$1.call(BoltExecutor.java:166) ~[storm-client-2.2.0.jar:2.2.0]
./worker.log-2021-10-15 13:51:32.756 STDERR Thread-2 [INFO]     at org.apache.storm.executor.bolt.BoltExecutor$1.call(BoltExecutor.java:159) ~[storm-client-2.2.0.jar:2.2.0]
./worker.log-2021-10-15 13:51:32.756 STDERR Thread-2 [INFO]     at org.apache.storm.utils.Utils$1.run(Utils.java:394) [storm-client-2.2.0.jar:2.2.0]
./worker.log-2021-10-15 13:51:32.756 STDERR Thread-2 [INFO]     at java.lang.Thread.run(Thread.java:748) [?:1.8.0_292]
./worker.log:2021-10-15 13:51:32.756 STDERR Thread-2 [INFO] Caused by: java.util.ConcurrentModificationException
./worker.log-2021-10-15 13:51:32.756 STDERR Thread-2 [INFO]     at java.util.HashMap$HashIterator.nextNode(HashMap.java:1445) ~[?:1.8.0_292]
./worker.log-2021-10-15 13:51:32.756 STDERR Thread-2 [INFO]     at java.util.HashMap$EntryIterator.next(HashMap.java:1479) ~[?:1.8.0_292]
./worker.log-2021-10-15 13:51:32.756 STDERR Thread-2 [INFO]     at java.util.HashMap$EntryIterator.next(HashMap.java:1477) ~[?:1.8.0_292]
./worker.log-2021-10-15 13:51:32.756 STDERR Thread-2 [INFO]     at com.esotericsoftware.kryo.serializers.MapSerializer.write(MapSerializer.java:99) ~[kryo-3.0.3.jar:?]
./worker.log-2021-10-15 13:51:32.756 STDERR Thread-2 [INFO]     at com.esotericsoftware.kryo.serializers.MapSerializer.write(MapSerializer.java:39) ~[kryo-3.0.3.jar:?]
./worker.log-2021-10-15 13:51:32.756 STDERR Thread-2 [INFO]     at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552) ~[kryo-3.0.3.jar:?]
./worker.log-2021-10-15 13:51:32.756 STDERR Thread-2 [INFO]     at com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:80) ~[kryo-3.0.3.jar:?]
./worker.log-2021-10-15 13:51:32.756 STDERR Thread-2 [INFO]     ... 15 more
./worker.log-2021-10-15 13:51:32.756 STDERR Thread-2 [INFO] 2021-10-15 13:51:32.756 o.a.s.e.e.ReportError Thread-16-fetcher-executor[10, 10] [ERROR] Error
./worker.log-2021-10-15 13:51:32.756 STDERR Thread-2 [INFO] java.lang.RuntimeException: com.esotericsoftware.kryo.KryoException: java.util.ConcurrentModificationException
./worker.log-2021-10-15 13:51:32.756 STDERR Thread-2 [INFO] Serialization trace:
./worker.log-2021-10-15 13:51:32.756 STDERR Thread-2 [INFO] md (com.digitalpebble.stormcrawler.Metadata)
./worker.log-2021-10-15 13:51:32.756 STDERR Thread-2 [INFO]     at org.apache.storm.utils.Utils$1.run(Utils.java:409) ~[storm-client-2.2.0.jar:2.2.0]
./worker.log-2021-10-15 13:51:32.756 STDERR Thread-2 [INFO]     at java.lang.Thread.run(Thread.java:748) [?:1.8.0_292]
./worker.log:2021-10-15 13:51:32.756 STDERR Thread-2 [INFO] Caused by: com.esotericsoftware.kryo.KryoException: java.util.ConcurrentModificationException
./worker.log-2021-10-15 13:51:32.756 STDERR Thread-2 [INFO] Serialization trace:
./worker.log-2021-10-15 13:51:32.757 STDERR Thread-2 [INFO] md (com.digitalpebble.stormcrawler.Metadata)
./worker.log-2021-10-15 13:51:32.757 STDERR Thread-2 [INFO]     at com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:101) ~[kryo-3.0.3.jar:?]
./worker.log-2021-10-15 13:51:32.757 STDERR Thread-2 [INFO]     at com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:518) ~[kryo-3.0.3.jar:?]
./worker.log-2021-10-15 13:51:32.757 STDERR Thread-2 [INFO]     at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:628) ~[kryo-3.0.3.jar:?]
./worker.log-2021-10-15 13:51:32.757 STDERR Thread-2 [INFO]     at com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:100) ~[kryo-3.0.3.jar:?]
./worker.log-2021-10-15 13:51:32.757 STDERR Thread-2 [INFO]     at com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:40) ~[kryo-3.0.3.jar:?]
./worker.log-2021-10-15 13:51:32.757 STDERR Thread-2 [INFO]     at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:534) ~[kryo-3.0.3.jar:?]
./worker.log-2021-10-15 13:51:32.757 STDERR Thread-2 [INFO]     at org.apache.storm.serialization.KryoValuesSerializer.serializeInto(KryoValuesSerializer.java:38) ~[storm-client-2.2.0.jar:2.2.0]
./worker.log-2021-10-15 13:51:32.757 STDERR Thread-2 [INFO]     at org.apache.storm.serialization.KryoTupleSerializer.serialize(KryoTupleSerializer.java:40) ~[storm-client-2.2.0.jar:2.2.0]
./worker.log-2021-10-15 13:51:32.757 STDERR Thread-2 [INFO]     at org.apache.storm.daemon.worker.WorkerTransfer.tryTransferRemote(WorkerTransfer.java:116) ~[storm-client-2.2.0.jar:2.2.0]
./worker.log-2021-10-15 13:51:32.757 STDERR Thread-2 [INFO]     at org.apache.storm.daemon.worker.WorkerState.tryTransferRemote(WorkerState.java:524) ~[storm-client-2.2.0.jar:2.2.0]
./worker.log-2021-10-15 13:51:32.757 STDERR Thread-2 [INFO]     at org.apache.storm.executor.ExecutorTransfer.tryTransfer(ExecutorTransfer.java:68) ~[storm-client-2.2.0.jar:2.2.0]
./worker.log-2021-10-15 13:51:32.757 STDERR Thread-2 [INFO]     at org.apache.storm.executor.bolt.BoltExecutor$1.tryFlushPendingEmits(BoltExecutor.java:200) ~[storm-client-2.2.0.jar:2.2.0]
./worker.log-2021-10-15 13:51:32.757 STDERR Thread-2 [INFO]     at org.apache.storm.executor.bolt.BoltExecutor$1.call(BoltExecutor.java:166) ~[storm-client-2.2.0.jar:2.2.0]
./worker.log-2021-10-15 13:51:32.757 STDERR Thread-2 [INFO]     at org.apache.storm.executor.bolt.BoltExecutor$1.call(BoltExecutor.java:159) ~[storm-client-2.2.0.jar:2.2.0]
./worker.log-2021-10-15 13:51:32.757 STDERR Thread-2 [INFO]     at org.apache.storm.utils.Utils$1.run(Utils.java:394) ~[storm-client-2.2.0.jar:2.2.0]
./worker.log-2021-10-15 13:51:32.757 STDERR Thread-2 [INFO]     ... 1 more
./worker.log:2021-10-15 13:51:32.757 STDERR Thread-2 [INFO] Caused by: java.util.ConcurrentModificationException
./worker.log-2021-10-15 13:51:32.757 STDERR Thread-2 [INFO]     at java.util.HashMap$HashIterator.nextNode(HashMap.java:1445) ~[?:1.8.0_292]
./worker.log-2021-10-15 13:51:32.757 STDERR Thread-2 [INFO]     at java.util.HashMap$EntryIterator.next(HashMap.java:1479) ~[?:1.8.0_292]
./worker.log-2021-10-15 13:51:32.757 STDERR Thread-2 [INFO]     at java.util.HashMap$EntryIterator.next(HashMap.java:1477) ~[?:1.8.0_292]
./worker.log-2021-10-15 13:51:32.757 STDERR Thread-2 [INFO]     at com.esotericsoftware.kryo.serializers.MapSerializer.write(MapSerializer.java:99) ~[kryo-3.0.3.jar:?]
./worker.log-2021-10-15 13:51:32.757 STDERR Thread-2 [INFO]     at com.esotericsoftware.kryo.serializers.MapSerializer.write(MapSerializer.java:39) ~[kryo-3.0.3.jar:?]
./worker.log-2021-10-15 13:51:32.757 STDERR Thread-2 [INFO]     at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552) ~[kryo-3.0.3.jar:?]
./worker.log-2021-10-15 13:51:32.757 STDERR Thread-2 [INFO]     at com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:80) ~[kryo-3.0.3.jar:?]
./worker.log-2021-10-15 13:51:32.757 STDERR Thread-2 [INFO]     at com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:518) ~[kryo-3.0.3.jar:?]
./worker.log-2021-10-15 13:51:32.757 STDERR Thread-2 [INFO]     at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:628) ~[kryo-3.0.3.jar:?]
./worker.log-2021-10-15 13:51:32.757 STDERR Thread-2 [INFO]     at com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:100) ~[kryo-3.0.3.jar:?]
./worker.log-2021-10-15 13:51:32.757 STDERR Thread-2 [INFO]     at com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:40) ~[kryo-3.0.3.jar:?]
./worker.log-2021-10-15 13:51:32.757 STDERR Thread-2 [INFO]     at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:534) ~[kryo-3.0.3.jar:?]
./worker.log-2021-10-15 13:51:32.757 STDERR Thread-2 [INFO]     at org.apache.storm.serialization.KryoValuesSerializer.serializeInto(KryoValuesSerializer.java:38) ~[storm-client-2.2.0.jar:2.2.0]
./worker.log-2021-10-15 13:51:32.757 STDERR Thread-2 [INFO]     at org.apache.storm.serialization.KryoTupleSerializer.serialize(KryoTupleSerializer.java:40) ~[storm-client-2.2.0.jar:2.2.0]
./worker.log-2021-10-15 13:51:32.757 STDERR Thread-2 [INFO]     at org.apache.storm.daemon.worker.WorkerTransfer.tryTransferRemote(WorkerTransfer.java:116) ~[storm-client-2.2.0.jar:2.2.0]
./worker.log-2021-10-15 13:51:32.757 STDERR Thread-2 [INFO]     at org.apache.storm.daemon.worker.WorkerState.tryTransferRemote(WorkerState.java:524) ~[storm-client-2.2.0.jar:2.2.0]
./worker.log-2021-10-15 13:51:32.757 STDERR Thread-2 [INFO]     at org.apache.storm.executor.ExecutorTransfer.tryTransfer(ExecutorTransfer.java:68) ~[storm-client-2.2.0.jar:2.2.0]
./worker.log-2021-10-15 13:51:32.757 STDERR Thread-2 [INFO]     at org.apache.storm.executor.bolt.BoltExecutor$1.tryFlushPendingEmits(BoltExecutor.java:200) ~[storm-client-2.2.0.jar:2.2.0]
./worker.log-2021-10-15 13:51:32.757 STDERR Thread-2 [INFO]     at org.apache.storm.executor.bolt.BoltExecutor$1.call(BoltExecutor.java:166) ~[storm-client-2.2.0.jar:2.2.0]
./worker.log-2021-10-15 13:51:32.757 STDERR Thread-2 [INFO]     at org.apache.storm.executor.bolt.BoltExecutor$1.call(BoltExecutor.java:159) ~[storm-client-2.2.0.jar:2.2.0]

@jnioche we are working on a workaround to see if we can resolve this last issue with the serialization, we will keep you posted.

What kind of issue is this?

Thanks!

jcruzmartini commented 3 years ago

cc @juli-alvarez @matiascrespof

jnioche commented 3 years ago

The sync has been removed in #904, this will be part of the next release. Which version of Storm is your cluster on? This should have been fixed in 2.2.0. I recently upgraded the dependency to 2.3.0

jcruzmartini commented 3 years ago

we are using

        <storm.version>2.2.0</storm.version>
        <storm.crawler.version>2.1</storm.crawler.version>

so should be fixed, but we will try overriding storm client dependency with

2.3.0 Another important thing to note is that we are using `fetcher.threads.per.queue: ` greater than 1, that is the default value. thanks @jnioche
jnioche commented 3 years ago

So your storm cluster is on 2.2.0?

Sent from my mobile, please excuse any typos

On Fri, 15 Oct 2021, 16:42 Juan Cruz Martini, @.***> wrote:

we are using

  <storm.version>2.2.0</storm.version>
  <storm.crawler.version>2.1</storm.crawler.version>

so should be fixed, but we will try overriding storm client dependency with

2.3.0 thanks @jnioche — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub , or unsubscribe . Triage notifications on the go with GitHub Mobile for iOS or Android .
jcruzmartini commented 3 years ago

@jnioche yes our cluster is in 2.2.0. @juli-alvarez was able to do a workaround when using fetcher.threads.per.queue: 5, basically he is doing this.

    private class FetcherThread extends Thread {

        @Override
        public void run() {
            while (true) {

             //before emitting we are creating a new instance of the metadata
             final Metadata immutableMd = new Metadata();
             immutableMd.putAll(metadata);
              collector.emit(com.digitalpebble.stormcrawler.Constants.StatusStreamName,
                      fit.t,
                      new Values(fit.url, immutableMd, Status.ERROR));
            }
        }

we are doing this before every collector.emitthat appears inside run() method. We know is not a nice hotfix, but at least is unblocking us for now. I think the easiest way to reproduce it is by using high numbers in fetcher.threads.per.queue:

jnioche commented 2 years ago

Hi @jcruzmartini any updates on this? I tried setting fetcher.threads.per.queue to 5 in one of my crawl but haven't been able to reproduce the issue.

jcruzmartini commented 2 years ago

Hi @jnioche we are still getting this exception, let me try to reproduce it using the SC without any custom change in order to guarantee that this is not something in our end

jcruzmartini commented 2 years ago

@jnioche closing this issue we were not able to reproduce the issue using the master branch with apache storm 2.3.0. thanks for your help

jnioche commented 2 years ago

@jcruzmartini glad it's fixed. Thanks for checking