An error is thrown when running run_dlrm_ubench_train_allreduce.sh

liligwu commented 2 years ago

When running mpirun --allow-run-as-root -np 8 -N 8 --bind-to none ./run_dlrm_ubench_train_allreduce.sh -c xxxx, an error is thrown:

Traceback (most recent call last): File "dlrm/ubench/dlrm_ubench_comms_driver.py", line 133, in <module> main() File "dlrm/ubench/dlrm_ubench_comms_driver.py", line 106, in main comms_main() File "/dockerx/FAMbench/11292021/FAMBench/param/train/comms/pt/comms.py", line 1208, in main collBenchObj.runBench(comms_world_info, commsParams) File "/dockerx/FAMbench/11292021/FAMBench/param/train/comms/pt/comms.py", line 1161, in runBench backendObj.benchmark_comms() File "/dockerx/FAMbench/11292021/FAMBench/param/train/comms/pt/pytorch_dist_backend.py", line 659, in benchmark_comms self.commsParams.benchTime(index, self.commsParams, self) File "/dockerx/FAMbench/11292021/FAMBench/param/train/comms/pt/comms.py", line 1128, in benchTime self.reportBenchTime( File "/dockerx/FAMbench/11292021/FAMBench/param/train/comms/pt/comms.py", line 853, in reportBenchTime self.reportBenchTimeColl(commsParams, results, tensorList) File "/dockerx/FAMbench/11292021/FAMBench/param/train/comms/pt/comms.py", line 860, in reportBenchTimeColl latencyAcrossRanks = np.array(tensorList) File "/opt/conda/lib/python3.8/site-packages/torch/_tensor.py", line 723, in __array__ return self.numpy() TypeError: can't convert cuda:0 device type tensor to numpy. Use Tensor.cpu() to copy the tensor to host memory first.

nrsatish commented 2 years ago

@samiwilf can you take a look?

samiwilf commented 2 years ago

This issue will be resolved by https://github.com/facebookresearch/FAMBench/pull/68

facebookresearch / FAMBench

An error is thrown when running run_dlrm_ubench_train_allreduce.sh #61