alibaba / euler

A distributed graph deep learning framework.
Apache License 2.0
2.89k stars 559 forks source link

分布式训练时 总是zk 相关错误 #203

Open thomasg19930417 opened 4 years ago

thomasg19930417 commented 4 years ago

错误日志如下: 请求帮助 E1219 13:43:45.176839 17771 zk_server_monitor.cc:150] ZK error when checking root node: connection loss.

MeliaLin commented 4 years ago

配置单机的zookeeper,zkServer.sh star con.cfg

thomasg19930417 commented 4 years ago

配置单机的zookeeper,zkServer.sh star con.cfg

local file io factory register 2019-12-20 10:21:38.942975: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:222] Initialize GrpcChannelCache for job ps -> {0 -> localhost:1998, 1 -> localhost:1999} 2019-12-20 10:21:38.943078: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:222] Initialize GrpcChannelCache for job worker -> {0 -> localhost:2000, 1 -> localhost:2001} 2019-12-20 10:21:38.954312: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:381] Started server with target: grpc://localhost:2000 WARNING: Logging before InitGoogleLogging() is written to STDERR I1220 10:21:38.963539 1775 remote_graph.cc:91] Initialize RemoteGraph, connect to server monitor: [localhost:2181, /path/for/euler] WARNING: Logging before InitGoogleLogging() is written to STDERR E1220 10:21:39.134063 1916 graph_engine.cc:75] no hdfs file io factory register I1220 10:21:39.134173 1916 graph_service.cc:179] service init finish E1220 10:21:39.136083 1916 graph_service.cc:157] service error

hi 已经改为单机得zk 了 然后又有了新的报错 这个是什么问题 能帮忙看下么

thomasg19930417 commented 4 years ago

image

MeliaLin commented 4 years ago

hadoop启动以后把文件上传到hdfs,编译的时候有个选项要把hdfs的off改成on,在cmakelist里,这个官网第一页有,改完以后要重新编译才能使用。 euler_zk_path用它默认的就行。

thomasg19930417 commented 4 years ago

用的 pip 安装得 默认的这个是不是不支持 hdfs ,源码编译报错了 所以就没编译

MeliaLin commented 4 years ago

默认不支持hdfs

thomasg19930417 commented 4 years ago

您这边有编译好的 包能提供下么 多谢

thomasg19930417 commented 4 years ago

image 我自己编译 一直有这个错误 没找到解决 方案 您这边是在centos 上编译的么

MeliaLin commented 4 years ago

我是在阿里云上

thomasg19930417 commented 4 years ago

哦哦 我这边是自己的虚拟机 我在试试看怎么编译 多谢解答

thomasg19930417 commented 4 years ago

E1223 11:39:22.739634 7745 graph_builder.cc:75] hdfs://172.17.1.62:8020/tmp/tgdata/ppi/ppi_train.id data error! E1223 11:39:24.497680 7743 graph_builder.cc:75] hdfs://172.17.1.62:8020/tmp/tgdata/ppi/ppi-id_map.json data error! E1223 11:39:28.144388 7742 graph_builder.cc:75] hdfs://172.17.1.62:8020/tmp/tgdata/ppi/ppi-G.json data error! E1223 11:39:31.212034 7744 graph_builder.cc:75] hdfs://172.17.1.62:8020/tmp/tgdata/ppi/ppi_data.json data error! I1223 11:39:31.212849 7639 graph_builder.cc:127] Each Thread Load Finish! Node Count:0 Edge Count:0 E1223 11:39:31.212896 7639 graph_builder.cc:131] Graph build failed! I1223 11:39:31.212918 7639 graph_service.cc:179] service init finish E1223 11:39:31.212937 7639 graph_service.cc:157] service error

thomasg19930417 commented 4 years ago

这段 Hdfs 读取数据 一直报错 这个您能给看下么

MeliaLin commented 4 years ago

https://github.com/alibaba/euler/wiki/%E6%95%B0%E6%8D%AE%E5%87%86%E5%A4%87 分布式hdfs上放的是数据分片