DTStack / chunjun

A data integration framework
https://dtstack.github.io/chunjun/
Apache License 2.0
3.98k stars 1.69k forks source link

[Feature][chunjun-core] Supports capturing dirty data from the source and when the source sends it downstream #1901 #1902

Closed david-gao1 closed 2 months ago

david-gao1 commented 3 months ago

… and when the source sends it downstream #1901

Purpose of this pull request

Which issue you fix

Fixes # (issue).

Checklist:

david-gao1 commented 2 months ago

问题再现:

CREATE` TABLE source
(
  `ID` int,
  `FloatColumn` string,
  `BinaryColumn` bytes,
  `VarBinaryColumn` bytes,
  `LongBinaryColumn` bytes
) WITH (
      'connector' = 'xxx-x',

      );
CREATE TABLE sink
(
  `ID` int,
  `FloatColumn` int,
  `BinaryColumn` string,
  `VarBinaryColumn` string,
  `LongBinaryColumn` string
) WITH (
      'connector' = 'stream-x'
      );
insert into sink 
select 
`ID` as `ID`,
CAST(`FloatColumn` AS int)  as `FloatColumn`, --比如这里数据源来一条脏数据为:111aa, 数据发送到下游算子时会报错,但此时脏数据无法捕获,脏数据管理器的能力就发挥不出来
CAST(`BinaryColumn` AS string)  as `BinaryColumn`,
CAST(`VarBinaryColumn` AS string)  as `VarBinaryColumn`,
CAST(`LongBinaryColumn` AS string)  as `LongBinaryColumn`
 from source ;
david-gao1 commented 2 months ago

本featrue主要是增强脏数据的捕获能力,这里能够获取数据源以及数据源向下游算子报错时的脏数据。这样会减少手动剔除脏数据的情况