Closed Zelldon closed 4 years ago
It seems that the deletion fails before it eventually happens.
I 2020-02-19T01:02:43.386899340Z 2020-02-19 01:02:43.386 [Broker-0-LogStream-1] [Broker-0-zb-actors-0] ERROR io.zeebe.util.actor - Actor failed in phase 'STARTED'. Continue with next job.
I 2020-02-19T01:02:43.386930552Z java.lang.IllegalStateException: Segment not open
at com.google.common.base.Preconditions.checkState(Preconditions.java:508) ~[guava-28.2-jre.jar:?]
at io.atomix.storage.journal.JournalSegment.checkOpen(JournalSegment.java:237) ~[atomix-storage-3.2.0-20200214.153035-125.jar:?]
at io.atomix.storage.journal.JournalSegment.createReader(JournalSegment.java:213) ~[atomix-storage-3.2.0-20200214.153035-125.jar:?]
at io.atomix.storage.journal.SegmentedJournalReader.replaceCurrentSegment(SegmentedJournalReader.java:204) ~[atomix-storage-3.2.0-20200214.153035-125.jar:?]
at io.atomix.storage.journal.SegmentedJournalReader.reset(SegmentedJournalReader.java:128) ~[atomix-storage-3.2.0-20200214.153035-125.jar:?]
at io.atomix.storage.journal.SegmentedJournalReader.reset(SegmentedJournalReader.java:136) ~[atomix-storage-3.2.0-20200214.153035-125.jar:?]
at io.atomix.storage.journal.DelegatingJournalReader.reset(DelegatingJournalReader.java:77) ~[atomix-storage-3.2.0-20200214.153035-125.jar:?]
at io.zeebe.logstreams.storage.atomix.AtomixLogStorageReader.findEntry(AtomixLogStorageReader.java:166) ~[zeebe-logstreams-0.23.0-SNAPSHOT.jar:0.23.0-SNAPSHOT]
at io.zeebe.logstreams.storage.atomix.AtomixLogStorageReader.lookUpApproximateAddress(AtomixLogStorageReader.java:123) ~[zeebe-logstreams-0.23.0-SNAPSHOT.jar:0.23.0-SNAPSHOT]
at io.zeebe.logstreams.impl.log.LogStreamReaderImpl.seek(LogStreamReaderImpl.java:70) ~[zeebe-logstreams-0.23.0-SNAPSHOT.jar:0.23.0-SNAPSHOT]
at io.zeebe.logstreams.impl.log.LogStreamImpl.lambda$delete$4(LogStreamImpl.java:168) ~[zeebe-logstreams-0.23.0-SNAPSHOT.jar:0.23.0-SNAPSHOT]
at io.zeebe.util.sched.ActorControl.lambda$call$0(ActorControl.java:150) ~[zeebe-util-0.23.0-SNAPSHOT.jar:0.23.0-SNAPSHOT]
at io.zeebe.util.sched.ActorJob.invoke(ActorJob.java:62) ~[zeebe-util-0.23.0-SNAPSHOT.jar:0.23.0-SNAPSHOT]
at io.zeebe.util.sched.ActorJob.execute(ActorJob.java:39) [zeebe-util-0.23.0-SNAPSHOT.jar:0.23.0-SNAPSHOT]
at io.zeebe.util.sched.ActorTask.execute(ActorTask.java:115) [zeebe-util-0.23.0-SNAPSHOT.jar:0.23.0-SNAPSHOT]
at io.zeebe.util.sched.ActorThread.executeCurrentTask(ActorThread.java:107) [zeebe-util-0.23.0-SNAPSHOT.jar:0.23.0-SNAPSHOT]
at io.zeebe.util.sched.ActorThread.doWork(ActorThread.java:91) [zeebe-util-0.23.0-SNAPSHOT.jar:0.23.0-SNAPSHOT]
at io.zeebe.util.sched.ActorThread.run(ActorThread.java:195) [zeebe-util-0.23.0-SNAPSHOT.jar:0.23.0-SNAPSHOT]
Ok I think it is related to the bug that the follower do not receive any snapshots https://github.com/zeebe-io/zeebe/issues/3763
But at some point it somehow gets an snapshot probably directly from atomix?
I 2020-02-19T01:02:42.786452257Z 2020-02-19 01:02:42.786 [Broker-2-DeletionService-1] [Broker-2-zb-actors-3] DEBUG io.zeebe.broker.system - The lowest exported position for snapshot /usr/local/zeebe/data/raft-partition/partitions/1/snapshots/44601777-5-1582072356421-104088534587480 is 104110014945072
I 2020-02-19T01:02:43.170791249Z 2020-02-19 01:02:43.168 [] [raft-server-0-raft-partition-partition-1] INFO org.camunda.feel.FeelEngine - Engine created. [value-mapper: CompositeValueMapper(List(org.camunda.feel.spi.JavaValueMapper@378027a4)), function-provider: org.camunda.feel.interpreter.FunctionProvider$EmptyFunctionProvider$@7b94b62b]
I 2020-02-19T01:02:43.172616614Z 2020-02-19 01:02:43.172 [] [raft-server-0-raft-partition-partition-1] DEBUG io.zeebe.broker.clustering.atomix.storage.snapshot.DbSnapshotStore - The last processed position for snapshot /usr/local/zeebe/data/raft-partition/partitions/1/pending/46183932-5-1582074160416 is 107868105519096
I 2020-02-19T01:02:43.179829601Z 2020-02-19 01:02:43.179 [] [raft-server-0-raft-partition-partition-1] INFO io.zeebe.broker.clustering.atomix.storage.snapshot.AtomixSnapshotStorage - Max snapshot count reached (4), purging snapshots older than DbSnapshot{directory=/usr/local/zeebe/data/raft-partition/partitions/1/snapshots/32866203-2-1582060685323-76729593378920, metadata=DbSnapshotMetadata{index=32866203, term=2, timestamp=2020-02-18 09:18:05,323, position=76729593378920}}
I 2020-02-19T01:02:43.179922771Z 2020-02-19 01:02:43.179 [] [raft-server-0-raft-partition-partition-1] DEBUG io.zeebe.broker.clustering.atomix.storage.snapshot.DbSnapshotStore - Deleting snapshot DbSnapshot{directory=/usr/local/zeebe/data/raft-partition/partitions/1/snapshots/32056241-2-1582059784322-74844100258816, metadata=DbSnapshotMetadata{index=32056241, term=2, timestamp=2020-02-18 09:03:04,322, position=74844100258816}}
I 2020-02-19T01:02:43.208208255Z 2020-02-19 01:02:43.207 [] [raft-server-0-raft-partition-partition-1] DEBUG io.zeebe.broker.clustering.atomix.storage.snapshot.AtomixSnapshotStorage - Snapshot DbSnapshot{directory=/usr/local/zeebe/data/raft-partition/partitions/1/snapshots/32056241-2-1582059784322-74844100258816, metadata=DbSnapshotMetadata{index=32056241, term=2, timestamp=2020-02-18 09:03:04,322, position=74844100258816}} removed from store io.zeebe.broker.clustering.atomix.storage.snapshot.DbSnapshotStore@62f0d5ac
I 2020-02-19T01:02:43.208330570Z 2020-02-19 01:02:43.208 [] [raft-server-0-raft-partition-partition-1] DEBUG io.zeebe.broker.clustering.atomix.storage.snapshot.DbSnapshotStore - Search for orphaned snapshots below oldest valid snapshot DbSnapshotMetadata{index=32866203, term=2, timestamp=2020-02-18 09:18:05,323, position=76729593378920}
I 2020-02-19T01:02:43.208674972Z 2020-02-19 01:02:43.208 [] [raft-server-0-raft-partition-partition-1] DEBUG io.zeebe.broker.clustering.atomix.storage.snapshot.DbSnapshotStore - Committed new snapshot DbSnapshot{directory=/usr/local/zeebe/data/raft-partition/partitions/1/snapshots/46183932-5-1582074160416-107868105519096, metadata=DbSnapshotMetadata{index=46183932, term=5, timestamp=2020-02-19 01:02:40,416, position=107868105519096}}
I 2020-02-19T01:02:43.378364337Z 2020-02-19 01:02:43.378 [Broker-0-DeletionService-1] [Broker-0-zb-actors-0] INFO org.camunda.feel.FeelEngine - Engine created. [value-mapper: CompositeValueMapper(List(org.camunda.feel.spi.JavaValueMapper@70fc2f87)), function-provider: org.camunda.feel.interpreter.FunctionProvider$EmptyFunctionProvider$@7b94b62b]
I 2020-02-19T01:02:43.380791449Z 2020-02-19 01:02:43.380 [Broker-0-DeletionService-1] [Broker-0-zb-actors-0] DEBUG io.zeebe.broker.system - The last processed position for snapshot /usr/local/zeebe/data/raft-partition/partitions/1/snapshots/32866203-2-1582060685323-76729593378920 is 76759658684368
I 2020-02-19T01:02:43.382122472Z 2020-02-19 01:02:43.381 [Broker-0-DeletionService-1] [Broker-0-zb-actors-0] DEBUG io.zeebe.broker.system - The lowest exported position for snapshot /usr/local/zeebe/data/raft-partition/partitions/1/snapshots/32866203-2-1582060685323-76729593378920 is 76755368374320
I 2020-02-19T01:02:43.386899340Z 2020-02-19 01:02:43.386 [Broker-0-LogStream-1] [Broker-0-zb-actors-0] ERROR io.zeebe.util.actor - Actor failed in phase 'STARTED'. Continue with next job.
I 2020-02-19T01:02:43.386930552Z java.lang.IllegalStateException: Segment not open
at com.google.common.base.Preconditions.checkState(Preconditions.java:508) ~[guava-28.2-jre.jar:?]
at io.atomix.storage.journal.JournalSegment.checkOpen(JournalSegment.java:237) ~[atomix-storage-3.2.0-20200214.153035-125.jar:?]
at io.atomix.storage.journal.JournalSegment.createReader(JournalSegment.java:213) ~[atomix-storage-3.2.0-20200214.153035-125.jar:?]
at io.atomix.storage.journal.SegmentedJournalReader.replaceCurrentSegment(SegmentedJournalReader.java:204) ~[atomix-storage-3.2.0-20200214.153035-125.jar:?]
at io.atomix.storage.journal.SegmentedJournalReader.reset(SegmentedJournalReader.java:128) ~[atomix-storage-3.2.0-20200214.153035-125.jar:?]
at io.atomix.storage.journal.SegmentedJournalReader.reset(SegmentedJournalReader.java:136) ~[atomix-storage-3.2.0-20200214.153035-125.jar:?]
at io.atomix.storage.journal.DelegatingJournalReader.reset(DelegatingJournalReader.java:77) ~[atomix-storage-3.2.0-20200214.153035-125.jar:?]
at io.zeebe.logstreams.storage.atomix.AtomixLogStorageReader.findEntry(AtomixLogStorageReader.java:166) ~[zeebe-logstreams-0.23.0-SNAPSHOT.jar:0.23.0-SNAPSHOT]
at io.zeebe.logstreams.storage.atomix.AtomixLogStorageReader.lookUpApproximateAddress(AtomixLogStorageReader.java:123) ~[zeebe-logstreams-0.23.0-SNAPSHOT.jar:0.23.0-SNAPSHOT]
at io.zeebe.logstreams.impl.log.LogStreamReaderImpl.seek(LogStreamReaderImpl.java:70) ~[zeebe-logstreams-0.23.0-SNAPSHOT.jar:0.23.0-SNAPSHOT]
at io.zeebe.logstreams.impl.log.LogStreamImpl.lambda$delete$4(LogStreamImpl.java:168) ~[zeebe-logstreams-0.23.0-SNAPSHOT.jar:0.23.0-SNAPSHOT]
at io.zeebe.util.sched.ActorControl.lambda$call$0(ActorControl.java:150) ~[zeebe-util-0.23.0-SNAPSHOT.jar:0.23.0-SNAPSHOT]
at io.zeebe.util.sched.ActorJob.invoke(ActorJob.java:62) ~[zeebe-util-0.23.0-SNAPSHOT.jar:0.23.0-SNAPSHOT]
at io.zeebe.util.sched.ActorJob.execute(ActorJob.java:39) [zeebe-util-0.23.0-SNAPSHOT.jar:0.23.0-SNAPSHOT]
at io.zeebe.util.sched.ActorTask.execute(ActorTask.java:115) [zeebe-util-0.23.0-SNAPSHOT.jar:0.23.0-SNAPSHOT]
at io.zeebe.util.sched.ActorThread.executeCurrentTask(ActorThread.java:107) [zeebe-util-0.23.0-SNAPSHOT.jar:0.23.0-SNAPSHOT]
at io.zeebe.util.sched.ActorThread.doWork(ActorThread.java:91) [zeebe-util-0.23.0-SNAPSHOT.jar:0.23.0-SNAPSHOT]
at io.zeebe.util.sched.ActorThread.run(ActorThread.java:195) [zeebe-util-0.23.0-SNAPSHOT.jar:0.23.0-SNAPSHOT]
I 2020-02-19T01:02:50.767706133Z 2020-02-19 01:02:50.765 [Broker-0-LogStream-1] [Broker-2-zb-actors-3] DEBUG io.zeebe.logstreams - Delete data from log stream until position '104110014945072' (address: '44612370').
Then it tries to delete the log, which seem to fail for partition 0.
As described in #3988 and https://github.com/zeebe-io/zeebe/issues/3893#issuecomment-598065790 rocksdb snapshotting can cause these drops. Furthermore we had problems on seek, which we fixed.
Describe the bug
We often wondered why we have these periodically throughput drops in our benchmarks.
These drops can be correlated to log compaction and garbage collection pauses, at least we see an increasing memory usage which drops at the same time.
Disk is also increasing and dropping at the same time
This seem not to happen always. It might be also related to higher load.
The heartbeat latency increases, which indicates that the system is on hold for a short time due to gc.