FederatedAI / FATE

An Industrial Grade Federated Learning Framework
Apache License 2.0
5.67k stars 1.55k forks source link

100w 数据上传异常 #5492

Closed songsong124 closed 2 months ago

songsong124 commented 7 months ago

版本1.7.2 上传100w 的数据出现异常

are connecting to the correct HDFS RPC port Traceback (most recent call last): File "/data/projects/fate/fate/python/fate_arch/storage/hdfs/_table.py", line 77, in _put_all writer.write(hdfs_utils.serialize(k, v)) File "pyarrow/io.pxi", line 283, in pyarrow.lib.NativeFile.write File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status OSError: HDFS Write failed, errno: 255 (Unknown error 255) Please check that you are connecting to the correct HDFS RPC port

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "pyarrow/io.pxi", line 262, in pyarrow.lib.NativeFile.flush File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status OSError: HDFS Flush failed, errno: 255 (Unknown error 255) Please check that you are connecting to the correct HDFS RPC port

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/data/projects/fate/fateflow/python/fate_flow/worker/task_executor.py", line 195, in run cpn_output = run_object.run(cpn_input) File "/data/projects/fate/fateflow/python/fate_flow/components/_base.py", line 149, in run method(cpn_input) File "/data/projects/fate/fateflow/python/fate_flow/components/upload.py", line 203, in _run data_table_count = self.save_data_table(job_id, name, namespace, storage_engine, head) File "/data/projects/fate/fateflow/python/fate_flow/components/upload.py", line 233, in save_data_table self.upload_file(input_file, head, job_id, input_feature_count) File "/data/projects/fate/fateflow/python/fate_flow/components/upload.py", line 338, in upload_file table.put_all(data) File "/data/projects/fate/fate/python/fate_arch/storage/_table.py", line 122, in put_all self._put_all(kv_list, **kwargs) File "/data/projects/fate/fate/python/fate_arch/storage/hdfs/_table.py", line 79, in _put_all counter = counter + 1 File "pyarrow/io.pxi", line 132, in pyarrow.lib.NativeFile.close File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status OSError: HDFS Flush failed, errno: 255 (Unknown error 255) Please check that you are connecting to the correct HDFS RPC port

hust-suwb commented 7 months ago

1.7版本的spark引擎在数据上传阶段应该是有问题的,官方在后续版本修复过,具体哪个版本修的不太确定,1.11版本至少是没有这个问题了。

songsong124 commented 7 months ago

hdfs-site.xml 中添加参数

dfs.client.block.write.replace-datanode-on-failure.policy NEVER

解决了上面上传的问题、我这是搭建的单机版本的hadoop\spark 、目前各流程正常运行。