alibaba / DataX

DataX是阿里云DataWorks数据集成的开源版本。
Other
15.55k stars 5.35k forks source link

data3.0版本读取、写入hdfs ec地址均异常 #2145

Closed KelvinChi closed 1 week ago

KelvinChi commented 3 weeks ago

DataX 3.0 读取HDFS EC地址异常

读取

当源地址通过hdfs ec -setPolicy -path "${sourcePath}" -policy RS-6-3-1024k转化后,源地址是一张orc hive表的数据路径。使用datax读取该地址并写入目标库,会出现以下异常:

经DataX智能分析,该任务最可能的错误原因是:
com.alibaba.datax.common.exception.DataXException: Code:[HdfsReader-12], Description:[文件类型目前不支持].  - 文件[hdfs://dphadoop/user/hive/warehouse/devdb.db/ads_asc_ro_retain_back_rate_1m_full/pt=20240619/part-00000-05701c0c-e8b6-4e31-990b-3194342198c6-c000]的类型与用户配置的fileType类型不一致,请确认您配置的目录下面所有文件的类型均为[orc]
        at com.alibaba.datax.common.exception.DataXException.asDataXException(DataXException.java:30)
        at com.alibaba.datax.plugin.reader.hdfsreader.DFSUtil.addSourceFileByType(DFSUtil.java:195)
        at com.alibaba.datax.plugin.reader.hdfsreader.DFSUtil.getHDFSAllFilesNORegex(DFSUtil.java:172)
        at com.alibaba.datax.plugin.reader.hdfsreader.DFSUtil.getHDFSAllFiles(DFSUtil.java:142)
        at com.alibaba.datax.plugin.reader.hdfsreader.DFSUtil.getAllFiles(DFSUtil.java:113)
        at com.alibaba.datax.plugin.reader.hdfsreader.HdfsReader$Job.prepare(HdfsReader.java:169)
        at com.alibaba.datax.core.job.JobContainer.prepareJobReader(JobContainer.java:737)
        at com.alibaba.datax.core.job.JobContainer.prepare(JobContainer.java:318)
        at com.alibaba.datax.core.job.JobContainer.start(JobContainer.java:118)
        at com.alibaba.datax.core.Engine.start(Engine.java:93)
        at com.alibaba.datax.core.Engine.entry(Engine.java:175)
        at com.alibaba.datax.core.Engine.main(Engine.java:208)

写入

源数据库为MySQL,通过hdfswriter写入到经hdfs ec -setPolicy -path "${targetPath}" -policy RS-6-3-1024k修改过的路径后,出现异常:

 - java.io.IOException: Unable to close file because the last block does not have enough number of replicas.
        at org.apache.hadoop.hdfs.DFSOutputStream.completeFile(DFSOutputStream.java:2273)
        at org.apache.hadoop.hdfs.DFSOutputStream.closeImpl(DFSOutputStream.java:2238)
        at org.apache.hadoop.hdfs.DFSOutputStream.close(DFSOutputStream.java:2204)
        at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:72)
        at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:106)
        at org.apache.hadoop.hive.ql.io.orc.WriterImpl.close(WriterImpl.java:2285)
        at org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat$OrcRecordWriter.close(OrcOutputFormat.java:106)
        at org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat$OrcRecordWriter.close(OrcOutputFormat.java:91)
        at com.alibaba.datax.plugin.writer.hdfswriter.HdfsHelper.orcFileStartWrite(HdfsHelper.java:398)
        at com.alibaba.datax.plugin.writer.hdfswriter.HdfsWriter$Task.startWrite(HdfsWriter.java:437)
        at com.alibaba.datax.core.taskgroup.runner.WriterRunner.run(WriterRunner.java:56)
        at java.lang.Thread.run(Thread.java:745)

        at com.alibaba.datax.common.exception.DataXException.asDataXException(DataXException.java:48)
        at com.alibaba.datax.plugin.writer.hdfswriter.HdfsHelper.orcFileStartWrite(HdfsHelper.java:404)
        at com.alibaba.datax.plugin.writer.hdfswriter.HdfsWriter$Task.startWrite(HdfsWriter.java:437)
        at com.alibaba.datax.core.taskgroup.runner.WriterRunner.run(WriterRunner.java:56)
        at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: Unable to close file because the last block does not have enough number of replicas.
        at org.apache.hadoop.hdfs.DFSOutputStream.completeFile(DFSOutputStream.java:2273)
        at org.apache.hadoop.hdfs.DFSOutputStream.closeImpl(DFSOutputStream.java:2238)
        at org.apache.hadoop.hdfs.DFSOutputStream.close(DFSOutputStream.java:2204)
        at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:72)
        at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:106)
        at org.apache.hadoop.hive.ql.io.orc.WriterImpl.close(WriterImpl.java:2285)
        at org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat$OrcRecordWriter.close(OrcOutputFormat.java:106)
        at org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat$OrcRecordWriter.close(OrcOutputFormat.java:91)
        at com.alibaba.datax.plugin.writer.hdfswriter.HdfsHelper.orcFileStartWrite(HdfsHelper.java:398)
        ... 3 more

DataX Json配置

读取

{
    "job": {
        "content": [
            {
                "reader": {
                    "parameter": {
                        "path": "/user/hive/warehouse/devdb.db/rate_1m_full/pt=20240619/",
                        "defaultFS": "hdfs://xxhadoop",
                        "hadoopConfig": {
                            "dfs.client.failover.proxy.provider.dphadoop": "org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider",
                            "dfs.namenode.rpc-address.dphadoop.nn1": "xxxxxxxxx:8020",
                            "dfs.namenode.rpc-address.dphadoop.nn2": "xxxxxxxxx:8020",
                            "dfs.ha.namenodes.dphadoop": "nn1,nn2",
                            "dfs.nameservices": "xxhadoop"
                        },
                        "column": [
                            {
                                "index": 0,
                                "name": "data_month",
                                "type": "string"
                            },
                            {
                                "index": 1,
                                "name": "vin",
                                "type": "string"
                            },
                            {
                                "index": 2,
                                "name": "category1_tag",
                                "type": "string"
                            }
                        ],
                        "fileType": "orc",
                        "encoding": "UTF-8",
                        "fieldDelimiter": "\u0001"
                    },
                    "name": "hdfsreader"
                },
                "writer": {
                    "parameter": {
                        "jdbcUrl": "jdbc:mysql://10.64.22.215:9030/",
                        "database": "sqdt",
                        "table": "full_devdb_20240625",
                        "column": [
                            "data_month",
                            "vin",
                            "category1_tag"
                        ],
                        "preSql": [
                            "delete from  sqdt.full_devdb_20240625 where data_month<=DATE_SUB('2024-06-19', INTERVAL 1 Month) or data_month='2024-06-19'"
                        ],
                        "postSql": [
                        ],
                        "username": "starrocks",
                        "password": "Star@alt2P$wd",
                        "loadUrl": [
                            "10.64.22.215:8030",
                            "10.64.22.217:8030",
                            "10.64.22.218:8030"
                        ],
                        "loadProps": {
                            "format": "json",
                            "strip_outer_array": true,
                            "strict_mode": true
                        }
                    },
                    "name": "starrockswriter"
                }
            }
        ],
        "setting": {
            "speed": {
                "channel": 5
            },
            "errorLimit": {
                "record": 0,
                "percentage": 0.02
            }
        }
    }
}

写入

{
    "job": {
        "content": [
            {
                "reader": {
                    "parameter": {
                        "connection": [
                            {
                                "jdbcUrl": [
                                    "jdbc:mysql://xx.xx.xx.xx:3306/hivedb?yearIsDateType=false&useUnicode=true&characterEncoding=utf-8"
                                ],
                                "querySql": [
                                    "select PART_ID,FROM_UNIXTIME(CREATE_TIME) CREATE_TIME,PART_NAME from `partitions`"
                                ]
                            }
                        ],
                        "username": "xxxx",
                        "password": "xxxxxxxxx"
                    },
                    "name": "mysqlreader"
                },
                "writer": {
                    "parameter": {
                        "path": "/user/hive/warehouse/devdb.db/test_20240624/pt=20240623",
                        "fileName": "data",
                        "defaultFS": "hdfs://xxhadoop",
                        "hadoopConfig": {
                            "dfs.client.failover.proxy.provider.dphadoop": "org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider",
                            "dfs.namenode.rpc-address.dphadoop.nn1": "xxxxxxxxx:8020",
                            "dfs.namenode.rpc-address.dphadoop.nn2": "xxxxxxxxx:8020",
                            "dfs.ha.namenodes.dphadoop": "nn1,nn2",
                            "dfs.nameservices": "xxhadoop",
                            "dfs.namenode.ec.enabled": "true",
                            "dfs.replication": 1
                        },

                        "column": [
                            {
                                "index": 0,
                                "name": "part_id",
                                "type": "string"
                            },
                            {
                                "index": 1,
                                "name": "create_time",
                                "type": "string"
                            },
                            {
                                "index": 2,
                                "name": "part_name",
                                "type": "string"
                            }
                        ],
                        "writeMode": "truncate",
                        "compress": "NONE",
                        "fieldDelimiter": "\u0001",
                        "fileType": "orc",
                        "encoding": "gbk",
                        "hiveJdbcUrl": "jdbc:hive2://xx.xx.xx.xx:8007/devdb",
                        "hivePreSql": "alter table devdb.test_20240624 drop if exists partition (pt='20240623'); alter table devdb.test_20240624 add if not exists partition (pt='20240623');",
                        "hiveUsername": "xxxx",
                        "hivePassword": "xxxxxxxxx"
                    },
                    "name": "hdfswriter"
                }
            }
        ],
        "setting": {
            "speed": {
                "channel": 5
            },
            "errorLimit": {
                "record": 0,
                "percentage": 0.02
            }
        }
    }
}
KelvinChi commented 1 week ago

把reader/writer中的hadoop/hive相关包全替换为3.3.3版本解决了该问题 xxxx@xxxx:libs ll hadoop- -rw-r--r-- 1 smcv hdfs 63348 Jul 11 14:45 hadoop-aliyun-3.3.3.jar -rw-r--r-- 1 smcv hdfs 25101 Jul 11 14:45 hadoop-annotations-3.3.3.jar -rw-r--r-- 1 smcv hdfs 104435 Jul 11 14:45 hadoop-auth-3.3.3.jar -rw-r--r-- 1 smcv hdfs 30524177 Jul 11 14:48 hadoop-client-runtime-3.3.3.jar -rw-r--r-- 1 smcv hdfs 4470534 Jul 11 14:45 hadoop-common-3.3.3.jar -rw-r--r-- 1 smcv hdfs 5500884 Jul 11 14:45 hadoop-hdfs-client-3.3.3.jar -rw-r--r-- 1 smcv hdfs 1636325 Jul 11 14:45 hadoop-mapreduce-client-core-3.3.3.jar -rw-r--r-- 1 smcv hdfs 3649778 Jul 11 14:45 hadoop-yarn-api-3.3.3.jar -rw-r--r-- 1 smcv hdfs 2965761 Jul 11 14:45 hadoop-yarn-common-3.3.3.jar -rw-r--r-- 1 smcv hdfs 258471 Jul 11 14:45 hadoop-yarn-server-applicationhistoryservice-3.3.3.jar -rw-r--r-- 1 smcv hdfs 1439997 Jul 11 14:45 hadoop-yarn-server-common-3.3.3.jar -rw-r--r-- 1 smcv hdfs 2492913 Jul 11 14:45 hadoop-yarn-server-resourcemanager-3.3.3.jar -rw-r--r-- 1 smcv hdfs 56807 Jul 11 14:45 hadoop-yarn-server-web-proxy-3.3.3.jar You have mail in /var/spool/mail/smcv xxxx@xxxx:libs ll hive- -rw-r--r-- 1 smcv hdfs 44704 Jul 11 14:45 hive-cli-2.3.9.jar -rw-r--r-- 1 smcv hdfs 436169 Jul 11 14:45 hive-common-2.3.9.jar -rw-r--r-- 1 smcv hdfs 45423312 Jul 11 14:45 hive-exec-2.3.9.jar -rw-r--r-- 1 smcv hdfs 265922 Jul 11 14:45 hive-hcatalog-core-2.3.9.jar -rw-r--r-- 1 smcv hdfs 116364 Jul 11 14:45 hive-jdbc-2.3.9.jar -rw-r--r-- 1 smcv hdfs 8195966 Jul 11 14:45 hive-metastore-2.3.9.jar -rw-r--r-- 1 smcv hdfs 916630 Jul 11 14:45 hive-serde-2.3.9.jar -rw-r--r-- 1 smcv hdfs 527783 Jul 11 14:45 hive-service-2.3.9.jar -rw-r--r-- 1 smcv hdfs 1549366 Jul 11 14:45 hive-service-rpc-2.3.9.jar -rw-r--r-- 1 smcv hdfs 53902 Jul 11 14:45 hive-shims-0.23-2.3.9.jar -rw-r--r-- 1 smcv hdfs 8786 Jul 11 14:45 hive-shims-2.3.9.jar -rw-r--r-- 1 smcv hdfs 119936 Jul 11 14:45 hive-shims-common-2.3.9.jar -rw-r--r-- 1 smcv hdfs 12923 Jul 11 14:45 hive-shims-scheduler-2.3.9.jar

cd "${HADOOP_HOME}"/share/hadoop/ cp client/hadoop-client-runtime-3.3.3.jar /data/datax_20240711/plugin/reader/hdfsreader/libs/ cp hdfs/lib/woodstox-core-5.3.0.jar /data/datax_20240711/plugin/reader/hdfsreader/libs/ cp hdfs/lib/stax2-api-4.2.1.jar /data/datax_20240711/plugin/reader/hdfsreader/libs/ cp hdfs/lib/commons-configuration2-2.1.1.jar /data/datax_20240711/plugin/reader/hdfsreader/libs/ cp hdfs/lib/re2j-1.1.jar /data/datax_20240711/plugin/reader/hdfsreader/libs/ cp client/hadoop-client-runtime-3.3.3.jar /data/datax_20240711/plugin/writer/hdfswriter/libs/ cp hdfs/lib/woodstox-core-5.3.0.jar /data/datax_20240711/plugin/writer/hdfswriter/libs/ cp hdfs/lib/stax2-api-4.2.1.jar /data/datax_20240711/plugin/writer/hdfswriter/libs/ cp hdfs/lib/commons-configuration2-2.1.1.jar /data/datax_20240711/plugin/writer/hdfswriter/libs/ cp hdfs/lib/re2j-1.1.jar /data/datax_20240711/plugin/writer/hdfswriter/libs/