DDD-7 Asynchronous Debezium Embedded Engine

vjuranek commented 11 months ago

https://issues.redhat.com/browse/DBZ-7073

vjuranek commented 11 months ago

In the past we had requests in the community if we are going to support multiple connectors in the Debezium server. Our answer was always "no", we want one connector per server. With future plan to have eventually some control plane over multiple servers. With this DDD I'm thinking about it again and I'm wondering if we should at least have open possibility to have multiple connectors under one engine/server. I'm thinking especially about the use case with one DB source connector and JDBC connector as the sink connector. IMHO this can be a good base for DB mirroring/migration/backup tool and no need to connect two independent processes would simplifies development, administration and there will be also performance gains.

For now it would be mostly implementation detail, but when in it, why not to do it right now if we agree this is a good decision. As I mentioned above, I'm creating Debezium connector and task abstractions and it would be natural to encapsulate most of the logic into these objects, e.g. in the engine run() method, instead of calling engine methods like

public void run() {
    ...
    startConnector(connector);
    createConnectorTasks(connector);
    startTasks(connector);
     ....
}

it would be something like this

public void run() {
    ...
    connector.start();
    connector.createTasks();
    connector.startTasks();
    ...
}

However, if we are sure we always want one connector per engine/server, it would be more natural to have most of the logic in the engine and DebeziumSourceConnector would act as a small support object (and eventually can be completely dropped).

Personally I'd prefer the first option (most of the logic in the connector and possibility to have multiple connectors in the future), with current implementation supporting only one connector.

WDYT?

jpechane commented 11 months ago

@vjuranek TBH, I am really not fan of mutlple connectors support. Please keep in mind that the target of Debezium Server is to be smaller/simpler version of runtime, not Kafka Connect competition. Your remark of the sink connector support is definitely valid but in that case we need to decide

1) Do we want to support JDBC sink connector? The intent already was that the JDBC sink will be slightly refactored and an API/SPI will be available to make it compatible with Debezium Server. Also if we'd like to intorduce a full parallel processing from transaction log to sink write then we might end up in skipping Kafka Connect API on the source side completely and we'll interact with Debezium framework directly. Also it'd be a bit strange to support JDBC sinks as we do now and then support JDBC sink in a different way. 2) If we want to support (or not support but provide an ability for) for arbitrary sink connector then maybe we should intorduce a Debezium Server sink adapter that will fit into sink implementation and serve as translator/integrator between Debezium Server sinka and Kafka Connect sink API. That way we could intorduce early support for JDBC sink and later on implement option 1).

vjuranek commented 11 months ago

Thanks @jpechane for the comments. I didn't mentioned it in my previous comment, but yes, the main drawback I see with this approach is that sooner or later we will end up with re-implementing most of the Kafka stuff, so I completely agree on this part. As for JDBC sink support - in general (read: in undefined future) I'd say yes, IMHO we should support it (the logic behind it is that we should provide the same level of support as for other connectors and as we provide runtime for source connectors, we should do it also for the sink). But definitely would be good to wait for the community first to see if there is any interest for it at all.

Anyway, I have no strong opinion about that, so no problem with keeping most of the logic in the engine itself. If we later on decide to support also sink connector, it should be quite easy do the refactoring.

jeremy-l-ford commented 10 months ago

Is there a possibility that the engine could provide both parallelism and ordering? In a manner akin to how Kafka handles its partitions, each message would get hashed by its key. The hashes would be assigned to a queue/thread. For the use case of an event whose key is the table name, this would gain parallelism while maintaining order at the table level.

Perhaps a pluggable impl to determine the hashes/queue id, where

serial = 1 queue defined and one hash / queue Id always returned
no order = N number of queues and random or smallest queue hash/id returned
hash based = N number of queues where message key determines hash / queue id

Perhaps this makes things overly complex though. Anyways, just a thought.

vjuranek commented 10 months ago

Thanks for your input @jeremy-l-ford , I think this a very good idea! I'll think more about it, but it shouldn't be that hard to implement it.

We probably cannot use it as the default and should still provide total order of messages as the default (e.g. when one table references values in another table with foreign key constraint, insert in such table would fail if it arrives sooner than the insert in referenced table), but it can be another option between running consumer in a single thread and running consumer in multiple threads. Basically what we would need here is to expose ExecutorService (or, in this case, set of ExecutorServices created as newSingleThreadExecutor()) to RecordProcessor, which would be little bit dangerous as we would pass ExecutorService to potentially unknown code (which may either shut down the service or handle exception in an unexpected way), but the benefits may outweigh the risks. Or possibly we can expose ExcutorService only to our implementation of RecordProcessor and if we decide in the future to provide RecordProcessor via SPI, we won't provide ExecutorService to these implementations.

Any thoughts on this @jpechane ?

debezium / debezium-design-documents

DDD-7 Asynchronous Debezium Embedded Engine #8