datavane / tis

Support agile DataOps Based on Flink, DataX and Flink-CDC, Chunjun with Web-UI
https://tis.pub
Apache License 2.0
1.02k stars 220 forks source link

[Bug][Tis]任务描述:执行Hive2Doris导入,传递SelectedTab的cols结构体中type: null,导致校验不通过,无法创建同步管道任务。 #391

Open alldatafounder opened 2 days ago

alldatafounder commented 2 days ago

任务描述:执行Hive2Doris导入,由于hive表的字段为null,传递SelectedTab的cols结构体中type: null,导致校验不通过,无法创建同步管道任务。 相关版本: Doris2.0.7, hive客户端是2.1.1-cdh , hive 服务端:2.3.2

image

目前我做了以下尝试:

image

hive表:

image image image image image image

解决思路判断:

判断理由:mysql,oracle都执行了这部分代码进行type赋值。从xml临时文件读取selectedTab的结构,然后对其type进行JDBCTypes赋值DataType。

猜测:hive这块没有完成这一步导致此问题

    public static void fillSelectedTabMeta(ISelectedTab tab,
                                           Function<ISelectedTab, Map<String, ColumnMetaData>> tableColsMetaGetter) {
        Map<String, ColumnMetaData> colsMeta = tableColsMetaGetter.apply(tab);
        ColumnMetaData colMeta = null;
        if (colsMeta.size() < 1) {
            throw new IllegalStateException("table:" + tab.getName() + " relevant cols meta can not be null");
        }
        for (CMeta col : tab.getCols()) {
            colMeta = colsMeta.get(col.getName());
            if (colMeta == null) {
                throw new IllegalStateException("col:" + col.getName() + " can not find relevant 'col' on " + tab.getName() + ",exist Keys:[" + colsMeta.keySet().stream().collect(Collectors.joining(",")) + "]");
            }
            col.setPk(colMeta.isPk());
            col.setType(colMeta.getType());
            col.setComment(colMeta.getComment());
            col.setNullable(colMeta.isNullable());
        }
    }
baisui1981 commented 2 days ago

对应表的建表 DDL

 CREATE TABLE `pokes`(                              
   `foo` int,                                       
   `bar` string)                                    
 ROW FORMAT SERDE                                   
   'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'  
 STORED AS INPUTFORMAT                              
   'org.apache.hadoop.mapred.TextInputFormat'       
 OUTPUTFORMAT                                       
   'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' 
 LOCATION                                           
   'hdfs://namenode:8020/user/hive/warehouse/pokes' 
 TBLPROPERTIES (                                    
   'transient_lastDdlTime'='1730453007');

insert into pokes(foo,bar) values (1,'name1'),(2,'name2'),(3,'name3'),(4,'name4');
alldatafounder commented 2 days ago

Hive服务端是2.3.1,Doris2.0.7, hive客户端是2.1.1-cdh,我们tis带的那个版本: 目前hive服务通过docker拉起,可以快速拉起hive查看问题,连接使用9083和10000端口 链接: https://pan.baidu.com/s/1yWRi1sLhZEqJah-YvyUYYA 提取码: q7ew

baisui1981 commented 1 day ago

创建一个支持 PARQUET文件格式的表

CREATE TABLE customer_transactions (
    transaction_id INT,
    customer_id INT,
    amount DECIMAL(10,2),
    product_code STRING,
    transaction_date TIMESTAMP
)
STORED AS PARQUET
LOCATION                                           
   'hdfs://namenode:8020/user/hive/warehouse/customer_transactions' 
TBLPROPERTIES (
    'parquet.compression'='SNAPPY',          -- 设置压缩算法为Snappy
    'parquet.block.size'='134217728',        -- 设置块大小为128MB
    'parquet.page.size'='1048576',           -- 设置页面大小为1MB
    'parquet.dictionary.enabled'='TRUE',     -- 启用字典编码
    'parquet.enable.dictionary'='TRUE',      -- 启用字典编码(重复参数,但确保生效)
    'parquet.write.support'='org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'  -- 指定输出格式类
);

初始化几条记录

INSERT INTO customer_transactions (transaction_id, customer_id, amount, product_code, transaction_date)
VALUES 
(1, 101, 150.00, 'A123', '2024-01-01 00:00:00'),
(2, 102, 200.50, 'B456', '2024-01-02 00:00:00'),
(3, 103, 75.25, 'C789', '2024-01-03 00:00:00'),
(4, 104, 300.00, 'D101', '2024-01-04 00:00:00'),
(5, 105, 50.00, 'E102', '2024-01-05 00:00:00');

解释:

这些参数可以根据你的具体需求进行调整,以优化存储和查询性能。