Dear author, thank you very much for your excellent work on this project. When I train my own SGDet model, I encounter two errors during the validation phase.
No.1 is as follows:
Traceback (most recent call last): File "/Dpan/wyc/realtime_rwsg/SGG-Benchmark/tools/relation_train_net.py", line 994, in <module> main() File "/Dpan/wyc/realtime_rwsg/SGG-Benchmark/tools/relation_train_net.py", line 973, in main model, best_checkpoint = train( ^^^^^^ File "/Dpan/wyc/realtime_rwsg/SGG-Benchmark/tools/relation_train_net.py", line 704, in train run_val(cfg, model, val_data_loaders, args['distributed'], logger, device=device) File "/Dpan/wyc/realtime_rwsg/SGG-Benchmark/tools/relation_train_net.py", line 843, in run_val if len(dataset_result) == 1: ^^^^^^^^^^^^^^^^^^Traceback (most recent call last): ^ File "/Dpan/wyc/realtime_rwsg/SGG-Benchmark/tools/relation_train_net.py", line 994, in <module> TypeError: object of type 'float' has no len() main() File "/Dpan/wyc/realtime_rwsg/SGG-Benchmark/tools/relation_train_net.py", line 973, in main model, best_checkpoint = train( ^^^^^^ File "/Dpan/wyc/realtime_rwsg/SGG-Benchmark/tools/relation_train_net.py", line 704, in train run_val(cfg, model, val_data_loaders, args['distributed'], logger, device=device) File "/Dpan/wyc/realtime_rwsg/SGG-Benchmark/tools/relation_train_net.py", line 848, in run_val dataset_result[k1][k2] = torch.distributed.all_reduce(torch.tensor(np.mean(v2)).to(device).unsqueeze(0)).item() / torch.distributed.get_world_size() ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ AttributeError: 'NoneType' object has no attribute 'item'
No.2 is as follows:Traceback (most recent call last): File "/Dpan/wyc/realtime_rwsg/SGG-Benchmark/tools/relation_train_net.py", line 1514, in <module> main() File "/Dpan/wyc/realtime_rwsg/SGG-Benchmark/tools/relation_train_net.py", line 1493, in main model, best_checkpoint = train( ^^^^^^ File "/Dpan/wyc/realtime_rwsg/SGG-Benchmark/tools/relation_train_net.py", line 1253, in train val_result = run_val(cfg, model, val_data_loaders, args['distributed'], logger) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Dpan/wyc/realtime_rwsg/SGG-Benchmark/tools/relation_train_net.py", line 1363, in run_val if len(dataset_result) == 1: ^^^^^^^^^^^^^^^^^^^ TypeError: object of type 'float' has no len() Traceback (most recent call last): File "/Dpan/wyc/realtime_rwsg/SGG-Benchmark/tools/relation_train_net.py", line 1514, in <module> main() File "/Dpan/wyc/realtime_rwsg/SGG-Benchmark/tools/relation_train_net.py", line 1493, in main model, best_checkpoint = train( ^^^^^^ File "/Dpan/wyc/realtime_rwsg/SGG-Benchmark/tools/relation_train_net.py", line 1253, in train val_result = run_val(cfg, model, val_data_loaders, args['distributed'], logger) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Dpan/wyc/realtime_rwsg/SGG-Benchmark/tools/relation_train_net.py", line 1368, in run_val dataset_result[k1][k2] = torch.distributed.all_reduce(torch.tensor(np.mean(v2)).to(device).unsqueeze(0)).item() / torch.distributed.get_world_size() ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Dpan/wyc/anaconda3/envs/rtrw_sg/lib/python3.11/site-packages/torch/distributed/c10d_logger.py", line 72, in wrapper return func(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/Dpan/wyc/anaconda3/envs/rtrw_sg/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 1992, in all_reduce work = group.allreduce([tensor], opts) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ RuntimeError: No backend type associated with device type cpu
Could you tell me how to solve them???Thank you very much!!!!!!!!
Yes there may be some issues if you want to train with multiple gpus. I will investigate that another day, in the meantime try to run the training on a single gpu, it should work.
Dear author, thank you very much for your excellent work on this project. When I train my own SGDet model, I encounter two errors during the validation phase. No.1 is as follows:
Traceback (most recent call last): File "/Dpan/wyc/realtime_rwsg/SGG-Benchmark/tools/relation_train_net.py", line 994, in <module> main() File "/Dpan/wyc/realtime_rwsg/SGG-Benchmark/tools/relation_train_net.py", line 973, in main model, best_checkpoint = train( ^^^^^^ File "/Dpan/wyc/realtime_rwsg/SGG-Benchmark/tools/relation_train_net.py", line 704, in train run_val(cfg, model, val_data_loaders, args['distributed'], logger, device=device) File "/Dpan/wyc/realtime_rwsg/SGG-Benchmark/tools/relation_train_net.py", line 843, in run_val if len(dataset_result) == 1: ^^^^^^^^^^^^^^^^^^Traceback (most recent call last): ^ File "/Dpan/wyc/realtime_rwsg/SGG-Benchmark/tools/relation_train_net.py", line 994, in <module> TypeError: object of type 'float' has no len() main() File "/Dpan/wyc/realtime_rwsg/SGG-Benchmark/tools/relation_train_net.py", line 973, in main model, best_checkpoint = train( ^^^^^^ File "/Dpan/wyc/realtime_rwsg/SGG-Benchmark/tools/relation_train_net.py", line 704, in train run_val(cfg, model, val_data_loaders, args['distributed'], logger, device=device) File "/Dpan/wyc/realtime_rwsg/SGG-Benchmark/tools/relation_train_net.py", line 848, in run_val dataset_result[k1][k2] = torch.distributed.all_reduce(torch.tensor(np.mean(v2)).to(device).unsqueeze(0)).item() / torch.distributed.get_world_size() ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ AttributeError: 'NoneType' object has no attribute 'item'
No.2 is as follows:
Traceback (most recent call last): File "/Dpan/wyc/realtime_rwsg/SGG-Benchmark/tools/relation_train_net.py", line 1514, in <module> main() File "/Dpan/wyc/realtime_rwsg/SGG-Benchmark/tools/relation_train_net.py", line 1493, in main model, best_checkpoint = train( ^^^^^^ File "/Dpan/wyc/realtime_rwsg/SGG-Benchmark/tools/relation_train_net.py", line 1253, in train val_result = run_val(cfg, model, val_data_loaders, args['distributed'], logger) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Dpan/wyc/realtime_rwsg/SGG-Benchmark/tools/relation_train_net.py", line 1363, in run_val if len(dataset_result) == 1: ^^^^^^^^^^^^^^^^^^^ TypeError: object of type 'float' has no len() Traceback (most recent call last): File "/Dpan/wyc/realtime_rwsg/SGG-Benchmark/tools/relation_train_net.py", line 1514, in <module> main() File "/Dpan/wyc/realtime_rwsg/SGG-Benchmark/tools/relation_train_net.py", line 1493, in main model, best_checkpoint = train( ^^^^^^ File "/Dpan/wyc/realtime_rwsg/SGG-Benchmark/tools/relation_train_net.py", line 1253, in train val_result = run_val(cfg, model, val_data_loaders, args['distributed'], logger) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Dpan/wyc/realtime_rwsg/SGG-Benchmark/tools/relation_train_net.py", line 1368, in run_val dataset_result[k1][k2] = torch.distributed.all_reduce(torch.tensor(np.mean(v2)).to(device).unsqueeze(0)).item() / torch.distributed.get_world_size() ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Dpan/wyc/anaconda3/envs/rtrw_sg/lib/python3.11/site-packages/torch/distributed/c10d_logger.py", line 72, in wrapper return func(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/Dpan/wyc/anaconda3/envs/rtrw_sg/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 1992, in all_reduce work = group.allreduce([tensor], opts) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ RuntimeError: No backend type associated with device type cpu
Could you tell me how to solve them???Thank you very much!!!!!!!!