[Umbrella] InLong Transform feature

Motivation

InLong Transform empowers InLong to expand its access and distribution capabilities, adapting to a richer variety of data protocols and reporting scenarios on the access side, and adapting to complex and diverse data distribution scenarios on the distribution side. This improves data quality and collaboration, providing connection, aggregation, filtering, grouping, value extraction, sampling, and other computing capabilities that are decoupled from the computing engine. It simplifies users' pre-processing operations for reporting data, lowers the threshold for data usage, simplifies users' pre-processing operations before starting data analysis, and focuses on the business value of data.

Scenarios

Data Cleansing: During the data integration process, it is necessary to clean data from different sources to eliminate errors, duplicates, and inconsistencies. Transform capabilities can help companies perform data cleansing more effectively and improve data quality.
Data Fusion: Combining data from different sources for unified analysis and reporting. Transform capabilities can handle data in different formats and structures, enabling data fusion and integration.
Data Standardization: Converting data into a unified standard format for cross-system and cross-platform data analysis. Transform capabilities can help companies achieve data standardization and normalization.
Data Partitioning and Indexing: To improve the performance of data queries and analysis, data needs to be partitioned and indexed. Transform capabilities can dynamically adjust field values for partitioning and indexing, thereby improving the performance of the data warehouse.
Data Aggregation and Calculation: During the data analysis process, data needs to be aggregated and calculated to extract valuable information. Transform capabilities can perform complex data aggregation and calculations, supporting multi-dimensional data analysis.
Data Security and Privacy Protection: During the data integration process, it is essential to ensure data security and privacy. Transform capabilities can implement data de-identification, encryption, and authorization management to protect data security and privacy.
Cross-team Data Sharing: For data security reasons, only filtered subsets of data streams are shared; for data dependency decoupling considerations, data interfaces are agreed upon with collaborating teams, dynamically adjusting the merging of multiple streams into the data stream interface.

Feature list

Rich Data Protocols

In addition to CSV and KV, standard protocols such as PB, JSON, and Thrift are supported, as well as business-customized HTTP packet and TCP packet protocols.
In the collection and distribution stages, Transform is integrated as an SDK to implement protocol processing and data conversion.

Decoupling from the Computing Engine

By using Transform's internal flow processing, the reference to the computing engine's operators is avoided, achieving decoupling from the computing engine.
Data output Writers and aggregation flow are registered to the Transform framework through defined interfaces, adapting to different computing engines.

Seamless and Lossless Changes

Transform supports periodically pulling from the Manager, enabling seamless and lossless configuration changes.
This avoids scenarios where changes to FlinkSQL and SparkSQL require job restarts.

Automatic Scaling

Transform tasks support scheduling between different computing jobs, achieving seamless and lossless automatic scaling.

Task list

[x] #10093
[x] #10109
[x] #10117
[ ] #10118
[ ] #10119
[ ] #10128
[x] #10129
[x] #10130

InLong Component

Other for not specified component

Are you willing to submit PR?

[X] Yes, I am willing to submit a PR!

Code of Conduct

[X] I agree to follow this project's Code of Conduct

apache / inlong