terminate called after throwing an instance of 'std::bad_alloc'

YoushaaMurhij commented 1 year ago

Hi! I tried to train the model with four Tesla V100 but encountered this problem. Could you please suggest any hint to fix it? thanks!

python3 -m torch.distributed.launch \
 --nproc_per_node=4 \
 --rdzv_endpoint=localhost:14430 train.py \
 --launcher pytorch \
 --cfg_file ./cfgs/dsvt_models/dsvt_plain_D512e.yaml \
 --sync_bn --logger_iter_interval 500

/home/user/.local/lib/python3.10/site-packages/torch/distributed/launch.py:178: FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torchrun.
Note that --use_env is set by default in torchrun.
If your script expects `--local_rank` argument to be set, please
change it to read from `os.environ['LOCAL_RANK']` instead. See 
https://pytorch.org/docs/stable/distributed.html#launch-utility for 
further instructions

  warnings.warn(
WARNING:torch.distributed.run:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
terminate called after throwing an instance of 'std::bad_alloc'
terminate called after throwing an instance of 'std::bad_alloc'
  what():  std::bad_alloc
  what():  std::bad_alloc
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 0 (pid: 36) of binary: /usr/bin/python3
scripts/dist_train.sh: line 17:    26 Segmentation fault      (core dumped) python3 -m torch.distributed.launch --nproc_per_node=${NGPUS} --rdzv_endpoint=localhost:${PORT} train.py --launcher pytorch ${PY_ARGS}

chenshi3 commented 1 year ago

The issue might be due to the version of the package torch_scatter, and you can try cleaning the pip cache and then reinstalling the torch_scatter carefully. Specifically, you may need to install version 2.0.9 or 2.0.10 of the package.

YoushaaMurhij commented 1 year ago

Thanks for your quick response! Now it works after changing torch_scatter version to 2.0.9.

Haiyang-W / DSVT

terminate called after throwing an instance of 'std::bad_alloc' #14