camunda / camunda

Process Orchestration Framework
https://camunda.com/platform/
3.34k stars 611 forks source link

Low load shows/highlights segmentation creation impact on process execution time #12311

Open Zelldon opened 1 year ago

Zelldon commented 1 year ago

Describe the bug

We have worked recently on a new work-around of creating segment async, in order to avoid the impact of segment creation on the process execution latency. It looks like we haven't be successful with this approach.

During our today's chaos days we (@npepinpe and I) observed again an major spike in process instance execution time every 6 minutes, up to 1.5 s whereas the avg is at 0.06s.

latency

We can see that the commit latency is as well at the same time high, which might be an issue because of the high io. commit

We first expected that to be related to snapshotting, but snapshots happen much more often.

snapshot-count

We can see that the segment creation is spiking at the same time, and are quite high.

segment

You might wonder why this is not visible in our benchmarks, and I think the issue is that it is hidden by the higher process execution latency which we have in our benchmark.

commitmedic latencymedic

BUT the segment creation metrics don't show such big values, so I don't know for sure.

To Reproduce

Create a benchmark with low load, and you will see the behavior.

Expected behavior

Segment creation shouldn't impact the process instance execution in such way.

Environment:

deepthidevaki commented 1 year ago

We can see that the segment creation is spiking at the same time, and are quite high.

Segment creation metrics shown are incorrect- 8yrs :smile: But interesting is the last written index update time, that also spikes around the same time.

Zelldon commented 1 year ago

:D Yeah but I guess also the 4 second spike is something which is not optimal :D

deepthidevaki commented 1 year ago

:D Yeah but I guess also the 4 second spike is something which is not optimal :D

But I don't think it is taking 4 seconds. The whole results shown are incorrect. May be you could try this workaround from #11814 by adding by (le) != 0

Zelldon commented 1 year ago

Yes it looks different

segmentcreation

But as you also mentioned here https://github.com/camunda/zeebe/issues/11814#issuecomment-1495584042

I'm not 100 % sure or convinced whether this is a valid solution, or we missing some data.

Compact metrics compact

Without the zeros:

compactwithout

Looks like we miss something :thinking: Idk. I also realized that we have here a different interval used as on other heatmaps :sweat_smile:

megglos commented 1 year ago

ZDP Triage:

=> we need to try to reproduce it once more to revisit if this is really related to the segment creation

Zelldon commented 1 year ago

I had a look at the dashboard overview and in my opinion this issue still exists

all

We can see that periodically (every 30 min) there are spikes in latency, up to 1 sec. For low load with one instance per second, this is not acceptable.


Checking one of these benchmarks we can see the following:

Pods are NOT restarted during this time, throughput is stable.

pod

In the process execution latency tab we can see the spikes as well.

latency

When taking a look at the journal metrics we can see the same spikes.

journal

Furthermore at the same time segment creation is happening

segment

@megglos I propose that we have another look at this.

npepinpe commented 1 year ago

Just a reminder, we already have a working solution for this that we've prototyped - recycling segment files. It worked pretty well, with the drawback that you end up using more disk size consistently. So the "solution" is known here, just a matter of prioritization (unless someone has another solution of course ;))

megglos commented 1 year ago

ZDP-Planning:

Zelldon commented 1 year ago

So again I stumbled over this. We have sometimes spikes up to 500ms, which I think is too high especially if we just run one instance per second. I was not able to correlate the spikes with role changes or pod restarts

latency

But to me it looks still it is related to the segment creation, but yeah I guess this needs to be investigated further.

segments

Zelldon commented 1 year ago

I hope I can take a look at it next week (when I'm medic).

Zelldon commented 1 year ago

I was not able to look into the issue

Zelldon commented 1 year ago

@npepinpe is there an issue regarding your comment https://github.com/camunda/zeebe/issues/12311#issuecomment-1612597259 here or an PR or branch you have somewhere?

npepinpe commented 1 year ago

There was a PR. The branch was probably deleted since. Let me see if I can pull it up.

npepinpe commented 1 year ago

https://github.com/camunda/zeebe/pull/11443

Zelldon commented 1 year ago

Seems like there's too many bugs to see something useful here. Let's stop that.

Love that :D

Zelldon commented 1 year ago

Thanks @npepinpe for digging it out

npepinpe commented 1 year ago

I remember the general approach worked, there was eventually no pre allocation occurring. But yeah, the PR was a hack so it ran into other issues :sweat_smile:

Zelldon commented 11 months ago

i will not have time to work or look at it soon