camunda / camunda

Process Orchestration Framework
https://camunda.com/platform/
3.33k stars 605 forks source link

Refactor internal communication layer #11309

Closed Zelldon closed 2 months ago

Zelldon commented 1 year ago

Sorry if there exist already an issue, but I was not able to find it.

Description

We have seen recently that we have many issues with the Atomix based internal networking/communication layer, which we use between gateway and broker and between broker - broker.

For example https://github.com/zeebe-io/zeebe-chaos/issues/294 were we send request over several minutes without detecting that the node was already gone. When trying to fix this issue via https://github.com/camunda/zeebe/pull/11307 it turned out to be rather hard to test and reason about (about the code in general).

This means right now the networking part is hard to maintain, hard to test and somehow a blackhole, since there are no metrics and no good logging.

Ideally we should spent some time to refactor this part of our system to be more confident in our networking, this would include introducing better visibility (logging + metrics ), improve maintability and readability and very important reduce the complexity and improve the testability.

We have already thought several times about (brought up initially by @npepinpe ) it to replace it also with grpc which would be a good opportunity, which also comes with lot of costs and risks of course. We need to discuss this further within the team.

megglos commented 1 year ago

Notes:

npepinpe commented 2 months ago

No capacity or drive at the moment. I think we would need a solid motivation and/or idea. Closing for now.

Very open to a concrete proposal.