Closed Sanzo00 closed 2 years ago
In v1.0.1, if client_num < 1, there will be in File system-based synchronization mode, and the error is "[2022-07-31 31:29:34.700907] Invalid file path: root://graphlearn E20220801 07:29:34.700927 2256 env.cc:112] File system not implemented: root://graphlearn F20220801 07:29:34.701110 2256 fs_coordinator.cc:42] Invalid tracker path: root://graphlearn/" This error is strange because the default build wheel package supports the local file system, and may be caused by not clearing the old address sync directory. You can try running gcn example on master branch which uses RPC based address synchronization instead.
感谢回复,也就是说我可以通过设置client_num
来解决这个问题,因为他会通过gl.set_tracker_mode(0)
来设置为rpc模式是吗?
另外我还有个疑问?我在用两台机器训练的时候,在每个epoch之后提示Epoch 59 out of range.
,这个是程序的bug吗?
我试了下,client_num = 1
,不会有这个输出信息。
# server 1
python -m torch.distributed.launch --use_env --nproc_per_node=1 --nnodes=2 --node_rank=0 --master_addr=172.30.115.32 --master_port=10086 train.py.bak --client_num=2
# server 2
python -m torch.distributed.launch --use_env --nproc_per_node=1 --nnodes=2 --node_rank=0 --master_addr=172.30.115.32 --master_port=10086 train.py.bak --client_num=2
client_num = 1
:
When client_num
> 1, there will be multiple clients connecting to a server, and multiple clients consume data on the server asynchronously. So there may be a client's request sent to the server to find the server's status is already out of range (consumed by other clients). This log is not an error or bug and does not affect normal execution.
明白了,我还有个问题想确认下: 我观察到每个server的每个epoch的loss和最后的Test Accuracy不一致,这是因为每个server负责的顶点不同,每个server只计算他负责的那些顶点的loss和accuracy,但是每个server最终得到的参数W应该是一致的吧(由torch DDP负责在每个epoch同步)?
server 1:
server 2:
Yes, in the current pytorch example, we sampled with graphlearn and then used the model part of PyG directly and trained it with pytorch ddp.
明白了,感谢您的回复。
我尝试运行
examples/oytorch/gcn/train.py
,遇到了如下问题:错误信息如下:
我不知道为什么,加上
client_num
参数就可以运行了,关于client_num的我看注释时说,用来制定pytorch dataloader worker的数量?期待您的回复,谢谢!