CREATE Deployment command processing can reach easily max message size

Zelldon commented 3 years ago

Description

With https://github.com/zeebe-io/zeebe/pull/6289 we introduced a new record called WorkflowRecord, which represents the workflow (with resource, checksum, key etc) on the stream. We are writing this workflow event as follow up event on the deployment partition, when we process the CREATE Deployment command. The issue here is that we currently also write the CREATED Deployment as follow up event, which also contains all the workflows and resources. This means we can reach the maxMessageSize much easier now with big deployments. In order to still support bigger deployments we should find a way to mitigate that, either by not writing the workflows/resources as part of the Deployment event (this might break operate) or something else.

This problem will become more prominent if we reduce the max message size.

Related support cases: https://jira.camunda.com/browse/SUPPORT-22559

Zelldon commented 3 years ago

FULLY_DISTRIBUTED contains also the complete deployment.

Zelldon commented 3 years ago

With https://github.com/zeebe-io/zeebe/pull/6444 we write an empty deployment record on FULLY_DISTRIBUTED to reduce the duplication of information (and not reach the limits so easily). Currently this record is not consumed (only the key by the applier), which makes this possible. If we see a need for more information we can revise/change that.

Zelldon commented 3 years ago

\cc @npepinpe I would like that you're aware of that

Zelldon commented 3 years ago

I can just say it again if we want to fix this, we probably want to do it before 1.0 @npepinpe @saig0

npepinpe commented 3 years ago

We will postpone this for now:

It's a bit too close to 1.0 now to do this change (which also affects other teams, e.g. Operate)
If we need to drop it later, as far as breaking changes, it's an easy one to migrate away from and should have low impact on most users
Tackling the issue of large messages (or rather, large "atomic" batches) should any be dealt by investigating chunking of these atomic batches, as deployment may not be the only case where it's required and just omitting some data is a band-aid at best

npepinpe commented 3 years ago

Guess the wrong issue was referenced :smile:

Zelldon commented 3 years ago

Operate issue https://jira.camunda.com/browse/OPE-1285

Zelldon commented 3 years ago

With #6921 we removed the duplicated resources from the deployed processes, which were part of the DeploymentRecord. Now the record contains only the metadata of the processes. Still we write the resources for the Deployment CREATED follow up event.

As written here https://github.com/camunda-cloud/zeebe/pull/6921#issuecomment-831092725 it was not simple / easy to remove this, since we relay on the content. We use the resources on redistributing, which we could avoid if we would distribute the ProcessRecord for example. But this might be also not ideal since it make the distribution a bit more complex and fragile, Example: Why one process of that deployment is already distributed but the other not, Why I can already create instances for X not for Y etc.

In order to resolve this current issue, we need to find a way how we could reduce further the written content and reduce the duplicated data.

Zelldon commented 3 years ago

I think it is up to you @npepinpe whether we want to tackle this and when. I will remove my assignment for now.

pihme commented 2 years ago

Hint: WorkflowRecord was renamed to ProcesssRecord

megglos commented 1 month ago

Raised in context of larger DMN deployments done by a customer here https://camunda.slack.com/archives/C037W9NMATG/p1721311530038739

And support case https://jira.camunda.com/browse/SUPPORT-22559

Input from @remcowesterhoud on the Slack thread

The specific problem described in the issue is actually fixed already. It mentions putting the resources in the Deployment.CREATED event, which we no longer do. We put it in the CREATED event for the resource, which we definitely need to store it in the state. It's also on the distribute command, which we also need as the other partition require the resource to store it. [..] Maybe some intermediary command before distribution could solve it. We could take the resources out of it and just reference it by key. The processor of that command could then fetch the resource and distribute it to the other partitions

remcowesterhoud commented 1 month ago

ZPA Triage:

We should focus on a solution for all cases, not specific cases
Users expect they can upload 4mb, this no true due to metadata and followup records
@korthout questions if we can just get rid of the internal limit. Set the 4mb limit on the outside and have no limit on the inside.

@camunda/zeebe-distributed-platform do you still see a technical limitation to have this inside limit?

npepinpe commented 1 month ago

Well, you are sending that response out technically, so it would prevent from responding I suppose :thinking:

Otherwise it's a performance issue, and possibly memory issue. If you keep the processing unbounded, since it's entirely in memory, you would be more likely to run out of memory. I think the real solution is to support chunked execution or some form of continuation.

korthout commented 1 month ago

you are sending that response out technically

That's not correct. We're only responding a single event record back, but the max message size limits all records in the record batch.

npepinpe commented 1 month ago

But a single record may still be > 4MB no?

korthout commented 1 month ago

But a single record may still be > 4MB no?

Sure, but the common cases are not related to surpassing the 4MB on a single record but on the combination of all follow-up records. That includes the case above here:

deploy a process at 2MB
response event contains less data (resource data is removed, and only metadata is added)
follow up records are more than 2MB, e.g. for generalized command distribution + metadata records

Same goes for creating a process instance with almost 4MB of variables:

response is just the ProcessInstanceCreation event with some small extra metadata (I believe only the process id will add to the size as it's a string)
but follow up records contain the process instance as well as individual variable records that may exceed the size due to metadata of each record

EDIT: as we move to REST APIs, the response size is also no longer a limitation

npepinpe commented 1 month ago

EDIT: as we move to REST APIs, the response size is also no longer a limitation

Arguably right now we still need to buffer them all in memory, so I would suspect larger responses could lead to OOM at times if we can't flush them out quickly enough :shrug:

At any rate, we can argue about the responses, but it remains that having unbounded records/batches will have an impact on performance, since right now we're not set up to chunk or split them, and instead we have to buffer everything in memory. So I would push back on having simply no limit at all, as I can guarantee someone will run into OOMs. There are some proposed fixes , though maybe there are simpler solutions I haven't considered.

korthout commented 3 weeks ago

ZPA triage:

Currently proposed solution: We shouldn’t duplicate data over several records, but use references instead (mainly focussing on command distribution)
Impact: leads to unclear rejections and (in some cases) banned instances
Prio: upcoming, because this keeps coming back

camunda / camunda

CREATE Deployment command processing can reach easily max message size #6314