[X] I had searched in the issues and found no similar issues.
Multiple table proposal
Backgroud & Motivation
In the CDC scenario, we found that when there are too many CDC Sources, too many database links will be occupied, which will affect the stability of the database.
For this reason, we expect to reduce the number of Sources when synchronizing all tables. Since the current design is that each Source synchronizes one table, we expect one Source to handle multiple tables.
Advantages: take up fewer database connections, reduce database pressure
Disadvantage: In the SeaTunnel zeta, multiple tables will be in a pipeline, and the granularity of fault tolerance will become larger.
Overall Design
Load CatalogFactory SPI through Config file.
Create Catalog using CatalogFactory.
Create CatalogTables with Catalog and configured options.
If the table does not exist in the sink, create an inferred CatalogTable in the sink.
Fill the obtained CatalogTables into TableFactoryContext, and use them inTableSinkFactory, TableSourceFactory, TableTransformFactory.
If Source supports multiple tables, its TableSourceFactory must implement the SupportMultipleTable interface, use the information of multiple CatalogTables to create MultipleRowType, and SeaTunnelSource#getProducedType will return MultipleRowType.
Use MultipleRowType inside Source to deserialize data into SeaTunnelRow, and add table name to SeaTunnelRow.
The engine distributes data according to MultipleRowType and SeaTunnelRow's table name.
// For Source deserialization and Row distribution
public class MultipleRowType implements SeaTunnelDataType<SeaTunnelRow> {
private final String[] tableNames;
private final SeaTunnelRowType[] rowTypes;
}
// Declare that the Source supports multiple tables, and control the number of tables by itself
public interface SupportMultipleTable {
/**
* A connector can pick tables and return the accepted and remaining tables.
*/
Result applyTables(TableFactoryContext context);
final class Result {
private final List<CatalogTable> acceptedTables;
private final List<CatalogTable> remainingTables;
private Result(
List<CatalogTable> acceptedTables,
List<CatalogTable> remainingTables) {
this.acceptedTables = acceptedTables;
this.remainingTables = remainingTables;
}
}
}
Adapter
SeaTunnel Zeta
// pseudo-code
public class DistributionTransform extends SeaTunnelTransform<Record<?>> {
// Use MultipleRowType to distribute records to corresponding data channels
private MultipleRowType multiRowType;
}
Flink
Operator chain: avoid row serialization of different structures
OutputTag & Context#output: Use side-output streams to distribute data to corresponding channels
Code of Conduct
Search before asking
Multiple table proposal
Backgroud & Motivation
In the CDC scenario, we found that when there are too many CDC Sources, too many database links will be occupied, which will affect the stability of the database. For this reason, we expect to reduce the number of Sources when synchronizing all tables. Since the current design is that each Source synchronizes one table, we expect one Source to handle multiple tables.
Advantages: take up fewer database connections, reduce database pressure Disadvantage: In the
SeaTunnel zeta
, multiple tables will be in a pipeline, and the granularity of fault tolerance will become larger.Overall Design
CatalogFactory
SPI through Config file.Catalog
usingCatalogFactory
.CatalogTable
s withCatalog
and configured options.CatalogTable
s intoTableFactoryContext
, and use them inTableSinkFactory
,TableSourceFactory
,TableTransformFactory
.TableSourceFactory
must implement theSupportMultipleTable
interface, use the information of multiple CatalogTables to createMultipleRowType
, andSeaTunnelSource#getProducedType
will returnMultipleRowType
.MultipleRowType
inside Source to deserialize data intoSeaTunnelRow
, and add table name toSeaTunnelRow
.MultipleRowType
andSeaTunnelRow
's table name.Config design
Related pseudo-code
Adapter
SeaTunnel Zeta
Flink
Task list
Are you willing to submit PR?