alibaba / DataX

DataX是阿里云DataWorks数据集成的开源版本。
Other
15.71k stars 5.38k forks source link

Bug:windows环境下配置hdfsWriter,由于windows文件分隔符为\,会导致写完hdfs删除临时目录的时候,误删数据库 #458

Open maozl opened 4 years ago

maozl commented 4 years ago

dataX日志如下: 09:30:26.485 [job-0] INFO  c.a.d.p.w.hdfswriter.HdfsWriter$Job - start rename file [hdfs://hacluster/user/hive/warehouse/ods.db/test_maozl504dceda_c549_49c5_b914_cf6483664813\fileb1d2c12d_4bf2_494c_8e1e_6f4be4fd8b37] to file [hdfs://hacluster/user/hive/warehouse/ods.db/test_maozl\fileb1d2c12d_4bf2_494c_8e1e_6f4be4fd8b37]. 09:30:26.524 [job-0] INFO  c.a.d.p.w.hdfswriter.HdfsWriter$Job - finish rename file [hdfs://hacluster/user/hive/warehouse/ods.db/test_maozl504dceda_c549_49c5_b914_cf6483664813\fileb1d2c12d_4bf2_494c_8e1e_6f4be4fd8b37] to file [hdfs://hacluster/user/hive/warehouse/ods.db/test_maozl\fileb1d2c12d_4bf2_494c_8e1e_6f4be4fd8b37]. 09:30:26.524 [job-0] INFO  c.a.d.p.w.hdfswriter.HdfsWriter$Job - start delete tmp dir [hdfs://hacluster/user/hive/warehouse/ods.db]. 09:30:26.551 [job-0] INFO  c.a.d.p.w.hdfswriter.HdfsWriter$Job - finish delete tmp dir [hdfs://hacluster/user/hive/warehouse/ods.db].

通过调试发现,是由于临时文件的目录里包含windows的文件分隔符\,导致删除临时文件的时候,hadoop的Path.getParent()找到的路径为hdfs://hacluster/user/hive/warehouse/ods.db,进而直接把数据库给删掉了。

修改方法: 将HdfsWriter里的文件分割符替换为IOUtils.DIR_SEPARATOR_UNIX

felix-thinkingdata commented 4 years ago

path.getParent() user int lastSlash = path.lastIndexOf(47);

ronnierry commented 2 years ago
  1. 修改 com.alibaba.datax.plugin.writer.hdfswriter.HdfsWriter 中 IOUtils.DIR_SEPARATOR 替换成 IOUtils.DIR_SEPARATOR_UNIX
  2. 在 com.alibaba.datax.plugin.writer.hdfswriter.HdfsHelper 中的 renameFile(HashSet tmpFiles, HashSet endFiles) 方法中 在 String dstFile = it2.next().toString(); 下 添加一行 dstFile = dstFile.replace("\","/");
  3. 打包编译 将本地安装的datax\plugin\writer\hdfswriter\hdfswriter-0.0.1-SNAPSHOT.jar 替换成你打包好的文件
zhbdesign commented 2 years ago

dataX日志如下: 09:30:26.485 [job-0] INFO  c.a.d.p.w.hdfswriter.HdfsWriter$Job - start rename file [hdfs://hacluster/user/hive/warehouse/ods.db/test_maozl504dceda_c549_49c5_b914_cf6483664813\fileb1d2c12d_4bf2_494c_8e1e_6f4be4fd8b37] to file [hdfs://hacluster/user/hive/warehouse/ods.db/test_maozl\fileb1d2c12d_4bf2_494c_8e1e_6f4be4fd8b37]. 09:30:26.524 [job-0] INFO  c.a.d.p.w.hdfswriter.HdfsWriter$Job - finish rename file [hdfs://hacluster/user/hive/warehouse/ods.db/test_maozl504dceda_c549_49c5_b914_cf6483664813\fileb1d2c12d_4bf2_494c_8e1e_6f4be4fd8b37] to file [hdfs://hacluster/user/hive/warehouse/ods.db/test_maozl\fileb1d2c12d_4bf2_494c_8e1e_6f4be4fd8b37]. 09:30:26.524 [job-0] INFO  c.a.d.p.w.hdfswriter.HdfsWriter$Job - start delete tmp dir [hdfs://hacluster/user/hive/warehouse/ods.db]. 09:30:26.551 [job-0] INFO  c.a.d.p.w.hdfswriter.HdfsWriter$Job - finish delete tmp dir [hdfs://hacluster/user/hive/warehouse/ods.db].

通过调试发现,是由于临时文件的目录里包含windows的文件分隔符\,导致删除临时文件的时候,hadoop的Path.getParent()找到的路径为hdfs://hacluster/user/hive/warehouse/ods.db,进而直接把数据库给删掉了。

修改方法: 将HdfsWriter里的文件分割符替换为IOUtils.DIR_SEPARATOR_UNIX

链接:https://pan.baidu.com/s/1yDMnuXnn0Y7qH6cGf4M_xA 提取码:ssup,这是dataX全量包,已经修复并验证过这个场景,可以使用