训练代码 - Githubissues

wutheringcoo commented 3 years ago

请教下：RuntimeError: [enforce fail at /pytorch/third_party/gloo/gloo/transport/tcp/http://device.cc:208] ifa != nullptr. Unable to find interface for: [10.16.32.68]，这是什么问题，在运行github训练代码时：python projects/SparseRCNN/train_net.py --num-gpus 4 \ --config-file projects/SparseRCNN/configs/sparsercnn.res50.100pro.3x.yaml \ --eval-only MODEL.WEIGHTS path/to/model.pth

PeizeSun commented 3 years ago

Hi~ 请问这是训练还是测试呀？训练代码中没有--eval-only MODEL.WEIGHTS path/to/model.pth的

wutheringcoo commented 3 years ago

老铁，我训练用的命令是：python projects/SparseRCNN/train_net.py --num-gpus 4 --dist-url tcp://10.16.100.68:61943 --config-file projects/SparseRCNN/configs/sparsercnn.res50.100pro.3x.yaml 没有--eval，不好意思之前描述的仓促不准确。具体诉求是：我用的含有4块GPU单机服务器用以上命令执行训练，在读取配置及网络参数过程中报发生错误了。训练中断前一句：[d2.data.build]: Distribution of instances among all 80 categories:。中断后一句[ d2.data.build]: Using training sampler TrainingSampler Traceback (most recent call last): File "projects/SparseRCNN/train_net.py", line 140, in args=(args,) File "/data-nbd/wuxc/SparseR-CNN/detectron2/engine/defaults.py", line 284, in init data_loader = self.build_train_loader(cfg) 最后问题显示为： RuntimeError: [enforce fail at /pytorch/third_party/gloo/gloo/transport/tcp/http://device.cc:208] ifa != nullptr. Unable to find interface for: [10.16.32.68] 所以我尝试修改默认的tcp://ip:port(也尝试使用默认的127.0.0.1访问本地服务器）,但结果显示都是同一个问题。最后只能将--num-gpus 设置运行才不会报以上多gpu通讯问题。所以如何才能实现单机4GPU训练呢？后续我才能进一步实现多机8GPU训练呢。谢谢！

iFighting commented 3 years ago

老铁，我训练用的命令是：python projects/SparseRCNN/train_net.py --num-gpus 4 --dist-url tcp://10.16.100.68:61943 --config-file projects/SparseRCNN/configs/sparsercnn.res50.100pro.3x.yaml 没有--eval，不好意思之前描述的仓促不准确。具体诉求是：我用的含有4块GPU单机服务器用以上命令执行训练，在读取配置及网络参数过程中报发生错误了。训练中断前一句：[d2.data.build]: Distribution of instances among all 80 categories:。中断后一句[ d2.data.build]: Using training sampler TrainingSampler Traceback (most recent call last): File "projects/SparseRCNN/train_net.py", line 140, in args=(args,) File "/data-nbd/wuxc/SparseR-CNN/detectron2/engine/defaults.py", line 284, in init data_loader = self.build_train_loader(cfg) 最后问题显示为： RuntimeError: [enforce fail at /pytorch/third_party/gloo/gloo/transport/tcp/http://device.cc:208] ifa != nullptr. Unable to find interface for: [10.16.32.68] 所以我尝试修改默认的tcp://ip:port(也尝试使用默认的127.0.0.1访问本地服务器）,但结果显示都是同一个问题。最后只能将--num-gpus 设置运行才不会报以上多gpu通讯问题。所以如何才能实现单机4GPU训练呢？后续我才能进一步实现多机8GPU训练呢。谢谢！

try tcp://127.0.0.1:50150?

wutheringcoo commented 3 years ago

老铁，我训练用的命令是：python projects/SparseRCNN/train_net.py --num-gpus 4 --dist-url tcp://10.16.100.68:61943 --config-file projects/SparseRCNN/configs/sparsercnn.res50.100pro.3x.yaml 没有--eval，不好意思之前描述的仓促不准确。具体诉求是：我用的含有4块GPU单机服务器用以上命令执行训练，在读取配置及网络参数过程中报发生错误了。训练中断前一句：[d2.data.build]: Distribution of instances among all 80 categories:。中断后一句[ d2.data.build]: Using training sampler TrainingSampler Traceback (most recent call last): File "projects/SparseRCNN/train_net.py", line 140, in args=(args,) File "/data-nbd/wuxc/SparseR-CNN/detectron2/engine/defaults.py", line 284, in init data_loader = self.build_train_loader(cfg) 最后问题显示为： RuntimeError: [enforce fail at /pytorch/third_party/gloo/gloo/transport/tcp/http://device.cc:208] ifa != nullptr. Unable to find interface for: [10.16.32.68] 所以我尝试修改默认的tcp://ip:port(也尝试使用默认的127.0.0.1访问本地服务器）,但结果显示都是同一个问题。最后只能将--num-gpus 设置运行才不会报以上多gpu通讯问题。所以如何才能实现单机4GPU训练呢？后续我才能进一步实现多机8GPU训练呢。谢谢！

try tcp://127.0.0.1:50150?

初步排查下来是和nvdia的nccl及gloo的设置有关系，目前能跑起来，不过遇到：FileNotFoundError: [Errno 2] No such file or directory: 'datasets/coco/train2017/000000537304.jpg'，这个图片是咋回事？COCO中压根也没有啊

wutheringcoo commented 3 years ago

此问题关闭，已解决

1061136002 commented 3 years ago

请教下：RuntimeError: [enforce fail at /pytorch/third_party/gloo/gloo/transport/tcp/http://device.cc:208] ifa != nullptr. Unable to find interface for: [10.16.32.68]，这是什么问题，在运行github训练代码时：python projects/SparseRCNN/train_net.py --num-gpus 4 --config-file projects/SparseRCNN/configs/sparsercnn.res50.100pro.3x.yaml --eval-only MODEL.WEIGHTS path/to/model.pth

咱俩报的错误类似，请问是怎么解决的啊

wutheringcoo commented 3 years ago

环境变量设置后端通讯方式: vi ~/.bashrc export GLOO_SOCKET_IFNAME=enp97s0f1,enp218s0f0

PeizeSun / SparseR-CNN

训练代码 #28