camunda / camunda

Process Orchestration Framework
https://camunda.com/platform/
3.22k stars 590 forks source link

Garbage Collection and log compaction causes periodically performance drops #3876

Closed Zelldon closed 4 years ago

Zelldon commented 4 years ago

Describe the bug

We often wondered why we have these periodically throughput drops in our benchmarks.

throughput

These drops can be correlated to log compaction and garbage collection pauses, at least we see an increasing memory usage which drops at the same time.

mem

Disk is also increasing and dropping at the same time

disk

This seem not to happen always. It might be also related to higher load.

The heartbeat latency increases, which indicates that the system is on hold for a short time due to gc. heartbeat

Zelldon commented 4 years ago

It seems that the deletion fails before it eventually happens.

I 2020-02-19T01:02:43.386899340Z 2020-02-19 01:02:43.386 [Broker-0-LogStream-1] [Broker-0-zb-actors-0] ERROR io.zeebe.util.actor - Actor failed in phase 'STARTED'. Continue with next job.

I 2020-02-19T01:02:43.386930552Z java.lang.IllegalStateException: Segment not open
    at com.google.common.base.Preconditions.checkState(Preconditions.java:508) ~[guava-28.2-jre.jar:?]
    at io.atomix.storage.journal.JournalSegment.checkOpen(JournalSegment.java:237) ~[atomix-storage-3.2.0-20200214.153035-125.jar:?]
    at io.atomix.storage.journal.JournalSegment.createReader(JournalSegment.java:213) ~[atomix-storage-3.2.0-20200214.153035-125.jar:?]
    at io.atomix.storage.journal.SegmentedJournalReader.replaceCurrentSegment(SegmentedJournalReader.java:204) ~[atomix-storage-3.2.0-20200214.153035-125.jar:?]
    at io.atomix.storage.journal.SegmentedJournalReader.reset(SegmentedJournalReader.java:128) ~[atomix-storage-3.2.0-20200214.153035-125.jar:?]
    at io.atomix.storage.journal.SegmentedJournalReader.reset(SegmentedJournalReader.java:136) ~[atomix-storage-3.2.0-20200214.153035-125.jar:?]
    at io.atomix.storage.journal.DelegatingJournalReader.reset(DelegatingJournalReader.java:77) ~[atomix-storage-3.2.0-20200214.153035-125.jar:?]
    at io.zeebe.logstreams.storage.atomix.AtomixLogStorageReader.findEntry(AtomixLogStorageReader.java:166) ~[zeebe-logstreams-0.23.0-SNAPSHOT.jar:0.23.0-SNAPSHOT]
    at io.zeebe.logstreams.storage.atomix.AtomixLogStorageReader.lookUpApproximateAddress(AtomixLogStorageReader.java:123) ~[zeebe-logstreams-0.23.0-SNAPSHOT.jar:0.23.0-SNAPSHOT]
    at io.zeebe.logstreams.impl.log.LogStreamReaderImpl.seek(LogStreamReaderImpl.java:70) ~[zeebe-logstreams-0.23.0-SNAPSHOT.jar:0.23.0-SNAPSHOT]
    at io.zeebe.logstreams.impl.log.LogStreamImpl.lambda$delete$4(LogStreamImpl.java:168) ~[zeebe-logstreams-0.23.0-SNAPSHOT.jar:0.23.0-SNAPSHOT]
    at io.zeebe.util.sched.ActorControl.lambda$call$0(ActorControl.java:150) ~[zeebe-util-0.23.0-SNAPSHOT.jar:0.23.0-SNAPSHOT]
    at io.zeebe.util.sched.ActorJob.invoke(ActorJob.java:62) ~[zeebe-util-0.23.0-SNAPSHOT.jar:0.23.0-SNAPSHOT]
    at io.zeebe.util.sched.ActorJob.execute(ActorJob.java:39) [zeebe-util-0.23.0-SNAPSHOT.jar:0.23.0-SNAPSHOT]
    at io.zeebe.util.sched.ActorTask.execute(ActorTask.java:115) [zeebe-util-0.23.0-SNAPSHOT.jar:0.23.0-SNAPSHOT]
    at io.zeebe.util.sched.ActorThread.executeCurrentTask(ActorThread.java:107) [zeebe-util-0.23.0-SNAPSHOT.jar:0.23.0-SNAPSHOT]
    at io.zeebe.util.sched.ActorThread.doWork(ActorThread.java:91) [zeebe-util-0.23.0-SNAPSHOT.jar:0.23.0-SNAPSHOT]
    at io.zeebe.util.sched.ActorThread.run(ActorThread.java:195) [zeebe-util-0.23.0-SNAPSHOT.jar:0.23.0-SNAPSHOT]
Zelldon commented 4 years ago

Ok I think it is related to the bug that the follower do not receive any snapshots https://github.com/zeebe-io/zeebe/issues/3763

But at some point it somehow gets an snapshot probably directly from atomix?

I 2020-02-19T01:02:42.786452257Z 2020-02-19 01:02:42.786 [Broker-2-DeletionService-1] [Broker-2-zb-actors-3] DEBUG io.zeebe.broker.system - The lowest exported position for snapshot /usr/local/zeebe/data/raft-partition/partitions/1/snapshots/44601777-5-1582072356421-104088534587480 is 104110014945072

I 2020-02-19T01:02:43.170791249Z 2020-02-19 01:02:43.168 [] [raft-server-0-raft-partition-partition-1] INFO  org.camunda.feel.FeelEngine - Engine created. [value-mapper: CompositeValueMapper(List(org.camunda.feel.spi.JavaValueMapper@378027a4)), function-provider: org.camunda.feel.interpreter.FunctionProvider$EmptyFunctionProvider$@7b94b62b]

I 2020-02-19T01:02:43.172616614Z 2020-02-19 01:02:43.172 [] [raft-server-0-raft-partition-partition-1] DEBUG io.zeebe.broker.clustering.atomix.storage.snapshot.DbSnapshotStore - The last processed position for snapshot /usr/local/zeebe/data/raft-partition/partitions/1/pending/46183932-5-1582074160416 is 107868105519096

I 2020-02-19T01:02:43.179829601Z 2020-02-19 01:02:43.179 [] [raft-server-0-raft-partition-partition-1] INFO  io.zeebe.broker.clustering.atomix.storage.snapshot.AtomixSnapshotStorage - Max snapshot count reached (4), purging snapshots older than DbSnapshot{directory=/usr/local/zeebe/data/raft-partition/partitions/1/snapshots/32866203-2-1582060685323-76729593378920, metadata=DbSnapshotMetadata{index=32866203, term=2, timestamp=2020-02-18 09:18:05,323, position=76729593378920}}

I 2020-02-19T01:02:43.179922771Z 2020-02-19 01:02:43.179 [] [raft-server-0-raft-partition-partition-1] DEBUG io.zeebe.broker.clustering.atomix.storage.snapshot.DbSnapshotStore - Deleting snapshot DbSnapshot{directory=/usr/local/zeebe/data/raft-partition/partitions/1/snapshots/32056241-2-1582059784322-74844100258816, metadata=DbSnapshotMetadata{index=32056241, term=2, timestamp=2020-02-18 09:03:04,322, position=74844100258816}}

I 2020-02-19T01:02:43.208208255Z 2020-02-19 01:02:43.207 [] [raft-server-0-raft-partition-partition-1] DEBUG io.zeebe.broker.clustering.atomix.storage.snapshot.AtomixSnapshotStorage - Snapshot DbSnapshot{directory=/usr/local/zeebe/data/raft-partition/partitions/1/snapshots/32056241-2-1582059784322-74844100258816, metadata=DbSnapshotMetadata{index=32056241, term=2, timestamp=2020-02-18 09:03:04,322, position=74844100258816}} removed from store io.zeebe.broker.clustering.atomix.storage.snapshot.DbSnapshotStore@62f0d5ac

I 2020-02-19T01:02:43.208330570Z 2020-02-19 01:02:43.208 [] [raft-server-0-raft-partition-partition-1] DEBUG io.zeebe.broker.clustering.atomix.storage.snapshot.DbSnapshotStore - Search for orphaned snapshots below oldest valid snapshot DbSnapshotMetadata{index=32866203, term=2, timestamp=2020-02-18 09:18:05,323, position=76729593378920}

I 2020-02-19T01:02:43.208674972Z 2020-02-19 01:02:43.208 [] [raft-server-0-raft-partition-partition-1] DEBUG io.zeebe.broker.clustering.atomix.storage.snapshot.DbSnapshotStore - Committed new snapshot DbSnapshot{directory=/usr/local/zeebe/data/raft-partition/partitions/1/snapshots/46183932-5-1582074160416-107868105519096, metadata=DbSnapshotMetadata{index=46183932, term=5, timestamp=2020-02-19 01:02:40,416, position=107868105519096}}

I 2020-02-19T01:02:43.378364337Z 2020-02-19 01:02:43.378 [Broker-0-DeletionService-1] [Broker-0-zb-actors-0] INFO  org.camunda.feel.FeelEngine - Engine created. [value-mapper: CompositeValueMapper(List(org.camunda.feel.spi.JavaValueMapper@70fc2f87)), function-provider: org.camunda.feel.interpreter.FunctionProvider$EmptyFunctionProvider$@7b94b62b]

I 2020-02-19T01:02:43.380791449Z 2020-02-19 01:02:43.380 [Broker-0-DeletionService-1] [Broker-0-zb-actors-0] DEBUG io.zeebe.broker.system - The last processed position for snapshot /usr/local/zeebe/data/raft-partition/partitions/1/snapshots/32866203-2-1582060685323-76729593378920 is 76759658684368

I 2020-02-19T01:02:43.382122472Z 2020-02-19 01:02:43.381 [Broker-0-DeletionService-1] [Broker-0-zb-actors-0] DEBUG io.zeebe.broker.system - The lowest exported position for snapshot /usr/local/zeebe/data/raft-partition/partitions/1/snapshots/32866203-2-1582060685323-76729593378920 is 76755368374320

I 2020-02-19T01:02:43.386899340Z 2020-02-19 01:02:43.386 [Broker-0-LogStream-1] [Broker-0-zb-actors-0] ERROR io.zeebe.util.actor - Actor failed in phase 'STARTED'. Continue with next job.

I 2020-02-19T01:02:43.386930552Z java.lang.IllegalStateException: Segment not open
    at com.google.common.base.Preconditions.checkState(Preconditions.java:508) ~[guava-28.2-jre.jar:?]
    at io.atomix.storage.journal.JournalSegment.checkOpen(JournalSegment.java:237) ~[atomix-storage-3.2.0-20200214.153035-125.jar:?]
    at io.atomix.storage.journal.JournalSegment.createReader(JournalSegment.java:213) ~[atomix-storage-3.2.0-20200214.153035-125.jar:?]
    at io.atomix.storage.journal.SegmentedJournalReader.replaceCurrentSegment(SegmentedJournalReader.java:204) ~[atomix-storage-3.2.0-20200214.153035-125.jar:?]
    at io.atomix.storage.journal.SegmentedJournalReader.reset(SegmentedJournalReader.java:128) ~[atomix-storage-3.2.0-20200214.153035-125.jar:?]
    at io.atomix.storage.journal.SegmentedJournalReader.reset(SegmentedJournalReader.java:136) ~[atomix-storage-3.2.0-20200214.153035-125.jar:?]
    at io.atomix.storage.journal.DelegatingJournalReader.reset(DelegatingJournalReader.java:77) ~[atomix-storage-3.2.0-20200214.153035-125.jar:?]
    at io.zeebe.logstreams.storage.atomix.AtomixLogStorageReader.findEntry(AtomixLogStorageReader.java:166) ~[zeebe-logstreams-0.23.0-SNAPSHOT.jar:0.23.0-SNAPSHOT]
    at io.zeebe.logstreams.storage.atomix.AtomixLogStorageReader.lookUpApproximateAddress(AtomixLogStorageReader.java:123) ~[zeebe-logstreams-0.23.0-SNAPSHOT.jar:0.23.0-SNAPSHOT]
    at io.zeebe.logstreams.impl.log.LogStreamReaderImpl.seek(LogStreamReaderImpl.java:70) ~[zeebe-logstreams-0.23.0-SNAPSHOT.jar:0.23.0-SNAPSHOT]
    at io.zeebe.logstreams.impl.log.LogStreamImpl.lambda$delete$4(LogStreamImpl.java:168) ~[zeebe-logstreams-0.23.0-SNAPSHOT.jar:0.23.0-SNAPSHOT]
    at io.zeebe.util.sched.ActorControl.lambda$call$0(ActorControl.java:150) ~[zeebe-util-0.23.0-SNAPSHOT.jar:0.23.0-SNAPSHOT]
    at io.zeebe.util.sched.ActorJob.invoke(ActorJob.java:62) ~[zeebe-util-0.23.0-SNAPSHOT.jar:0.23.0-SNAPSHOT]
    at io.zeebe.util.sched.ActorJob.execute(ActorJob.java:39) [zeebe-util-0.23.0-SNAPSHOT.jar:0.23.0-SNAPSHOT]
    at io.zeebe.util.sched.ActorTask.execute(ActorTask.java:115) [zeebe-util-0.23.0-SNAPSHOT.jar:0.23.0-SNAPSHOT]
    at io.zeebe.util.sched.ActorThread.executeCurrentTask(ActorThread.java:107) [zeebe-util-0.23.0-SNAPSHOT.jar:0.23.0-SNAPSHOT]
    at io.zeebe.util.sched.ActorThread.doWork(ActorThread.java:91) [zeebe-util-0.23.0-SNAPSHOT.jar:0.23.0-SNAPSHOT]
    at io.zeebe.util.sched.ActorThread.run(ActorThread.java:195) [zeebe-util-0.23.0-SNAPSHOT.jar:0.23.0-SNAPSHOT]

I 2020-02-19T01:02:50.767706133Z 2020-02-19 01:02:50.765 [Broker-0-LogStream-1] [Broker-2-zb-actors-3] DEBUG io.zeebe.logstreams - Delete data from log stream until position '104110014945072' (address: '44612370').

Then it tries to delete the log, which seem to fail for partition 0. atomix

Zelldon commented 4 years ago

As described in #3988 and https://github.com/zeebe-io/zeebe/issues/3893#issuecomment-598065790 rocksdb snapshotting can cause these drops. Furthermore we had problems on seek, which we fixed.