Open Zelldon opened 1 year ago
We can see that the segment creation is spiking at the same time, and are quite high.
Segment creation metrics shown are incorrect- 8yrs :smile: But interesting is the last written index update time, that also spikes around the same time.
:D Yeah but I guess also the 4 second spike is something which is not optimal :D
:D Yeah but I guess also the 4 second spike is something which is not optimal :D
But I don't think it is taking 4 seconds. The whole results shown are incorrect. May be you could try this workaround from #11814 by adding by (le) != 0
Yes it looks different
But as you also mentioned here https://github.com/camunda/zeebe/issues/11814#issuecomment-1495584042
I'm not 100 % sure or convinced whether this is a valid solution, or we missing some data.
Compact metrics
Without the zeros:
Looks like we miss something :thinking: Idk. I also realized that we have here a different interval used as on other heatmaps :sweat_smile:
ZDP Triage:
=> we need to try to reproduce it once more to revisit if this is really related to the segment creation
I had a look at the dashboard overview and in my opinion this issue still exists
We can see that periodically (every 30 min) there are spikes in latency, up to 1 sec. For low load with one instance per second, this is not acceptable.
Checking one of these benchmarks we can see the following:
Pods are NOT restarted during this time, throughput is stable.
In the process execution latency tab we can see the spikes as well.
When taking a look at the journal metrics we can see the same spikes.
Furthermore at the same time segment creation is happening
@megglos I propose that we have another look at this.
Just a reminder, we already have a working solution for this that we've prototyped - recycling segment files. It worked pretty well, with the drawback that you end up using more disk size consistently. So the "solution" is known here, just a matter of prioritization (unless someone has another solution of course ;))
ZDP-Planning:
So again I stumbled over this. We have sometimes spikes up to 500ms, which I think is too high especially if we just run one instance per second. I was not able to correlate the spikes with role changes or pod restarts
But to me it looks still it is related to the segment creation, but yeah I guess this needs to be investigated further.
I hope I can take a look at it next week (when I'm medic).
I was not able to look into the issue
@npepinpe is there an issue regarding your comment https://github.com/camunda/zeebe/issues/12311#issuecomment-1612597259 here or an PR or branch you have somewhere?
There was a PR. The branch was probably deleted since. Let me see if I can pull it up.
Seems like there's too many bugs to see something useful here. Let's stop that.
Love that :D
Thanks @npepinpe for digging it out
I remember the general approach worked, there was eventually no pre allocation occurring. But yeah, the PR was a hack so it ran into other issues :sweat_smile:
i will not have time to work or look at it soon
Describe the bug
We have worked recently on a new work-around of creating segment async, in order to avoid the impact of segment creation on the process execution latency. It looks like we haven't be successful with this approach.
During our today's chaos days we (@npepinpe and I) observed again an major spike in process instance execution time every 6 minutes, up to 1.5 s whereas the avg is at 0.06s.
We can see that the commit latency is as well at the same time high, which might be an issue because of the high io.
We first expected that to be related to snapshotting, but snapshots happen much more often.
We can see that the segment creation is spiking at the same time, and are quite high.
You might wonder why this is not visible in our benchmarks, and I think the issue is that it is hidden by the higher process execution latency which we have in our benchmark.
BUT the segment creation metrics don't show such big values, so I don't know for sure.
To Reproduce
Create a benchmark with low load, and you will see the behavior.
Expected behavior
Segment creation shouldn't impact the process instance execution in such way.
Environment: