ymlab commented 1 year ago

This script can help: https://github.com/Sense-GVT/Fast-BEV/blob/dev/tools/fastbev_run.sh

hly2990 commented 1 year ago

Could you provide an example of starting train?

Rex-LK commented 1 year ago

I tried 2 ways of train 1、sh tools/fastbev_run.sh (slurm_train $PARTITION 1 fastbev/exp/paper/fastbev_m0_r18_s256x704_v200x200x4_c192_d2_f4)

MMDET3D: /data/rex/BEV/Fast-BEV-dev SRUN_ARGS: -s basename: missing operand Try 'basename --help' for more information. tools/fastbev_run.sh: 16: arithmetic expression: division by zero: "fastbev/exp/paper/fastbev_m0_r18_s256x704_v200x200x4_c192_d2_f4<8?fastbev/exp/paper/fastbev_m0_r18_s256x704_v200x200x4_c192_d2_f4:8"

2、python tools/train.py configs/fastbev/exp/paper/fastbev_m0_r18_s256x704_v200x200x4_c192_d2_f4.py python tools/train.py configs/fastbev/exp/paper/fastbev_m0_r18_s256x704_v200x200x4_c192_d2_f4.py error: File "/home/snk/anaconda3/envs/fastbev/lib/python3.8/site-packages/nuscenes/nuscenes.py", line 225, in getind return self._token2ind[table_name][token] KeyError: '442c7729b9d0455ca75978f1a7fdab3a'

ymlab commented 1 year ago

The training script provided by this repo is based on slurm. You can see that two parameters need to be passed in the training script, namely PARTITION and QUOTATYPE(https://github.com/Sense-GVT/Fast-BEV/blob/dev/tools/fastbev_run.sh#L93 ), which represent the specified training partition and resource allocation type respectively. Maybe sometime I can try to provide a dist-based training script(https://github.com/open-mmlab/mmdetection/tree/master/tools).

As a example, my general command to start the training script is: sh ./tools/fastbev_run.sh Test reserved

ymlab commented 1 year ago

I tried 2 ways of train 1、sh tools/fastbev_run.sh (slurm_train $PARTITION 1 fastbev/exp/paper/fastbev_m0_r18_s256x704_v200x200x4_c192_d2_f4)

MMDET3D: /data/rex/BEV/Fast-BEV-dev SRUN_ARGS: -s basename: missing operand Try 'basename --help' for more information. tools/fastbev_run.sh: 16: arithmetic expression: division by zero: "fastbev/exp/paper/fastbev_m0_r18_s256x704_v200x200x4_c192_d2_f4<8?fastbev/exp/paper/fastbev_m0_r18_s256x704_v200x200x4_c192_d2_f4:8"

2、python tools/train.py configs/fastbev/exp/paper/fastbev_m0_r18_s256x704_v200x200x4_c192_d2_f4.py python tools/train.py configs/fastbev/exp/paper/fastbev_m0_r18_s256x704_v200x200x4_c192_d2_f4.py error: File "/home/snk/anaconda3/envs/fastbev/lib/python3.8/site-packages/nuscenes/nuscenes.py", line 225, in getind return self._token2ind[table_name][token] KeyError: '442c7729b9d0455ca75978f1a7fdab3a'

I am not sure about the specific reason for the second error.

hly2990 commented 1 year ago

Thank you for your answer. But, while running pytorch code on multi-GPUs in single server, my program didn't response for a long time with one setting --os.environ['WORLD_SIZE'] = '2'. So could you provide some code example running on one server rather than a cluster? thanks a lot!

Ignite616 commented 1 year ago

I also running on multi-GPUs in single server. When I runpython tools/train.py configs/fastbev/exp/paper/fastbev_m1_r18_s320x880_v200x200x4_c192_d2_f4.py --launcher pytorch,I have an error. Can you guide me that how to set the rank, or how do you realize single-machine multi-card training.
File "/home/shunan_zhl/anaconda3/envs/huofeng1/lib/python3.8/site-packages/mmcv/runner/dist_utils.py", line 18, in init_dist
_init_dist_pytorch(backend, **kwargs)
File "/home/shunan_zhl/anaconda3/envs/huofeng1/lib/python3.8/site-packages/mmcv/runner/dist_utils.py", line 29, in _init_dist_pytorch
rank = int(os.environ['RANK'])
File "/home/shunan_zhl/anaconda3/envs/huofeng1/lib/python3.8/os.py", line 675, in __getitem__
raise KeyError(key) from None
KeyError: 'RANK'

ymlab commented 1 year ago

I also running on multi-GPUs in single server. When I runpython tools/train.py configs/fastbev/exp/paper/fastbev_m1_r18_s320x880_v200x200x4_c192_d2_f4.py --launcher pytorch,I have an error. Can you guide me that how to set the rank, or how do you realize single-machine multi-card training.
File "/home/shunan_zhl/anaconda3/envs/huofeng1/lib/python3.8/site-packages/mmcv/runner/dist_utils.py", line 18, in init_dist
_init_dist_pytorch(backend, **kwargs)
File "/home/shunan_zhl/anaconda3/envs/huofeng1/lib/python3.8/site-packages/mmcv/runner/dist_utils.py", line 29, in _init_dist_pytorch
rank = int(os.environ['RANK'])
File "/home/shunan_zhl/anaconda3/envs/huofeng1/lib/python3.8/os.py", line 675, in __getitem__
raise KeyError(key) from None
KeyError: 'RANK'

This might help: https://github.com/Sense-GVT/Fast-BEV/issues/18

hly2990 commented 1 year ago

I also running on multi-GPUs in single server. When I runpython tools/train.py configs/fastbev/exp/paper/fastbev_m1_r18_s320x880_v200x200x4_c192_d2_f4.py --launcher pytorch,I have an error. Can you guide me that how to set the rank, or how do you realize single-machine multi-card training.
File "/home/shunan_zhl/anaconda3/envs/huofeng1/lib/python3.8/site-packages/mmcv/runner/dist_utils.py", line 18, in init_dist
_init_dist_pytorch(backend, **kwargs)
File "/home/shunan_zhl/anaconda3/envs/huofeng1/lib/python3.8/site-packages/mmcv/runner/dist_utils.py", line 29, in _init_dist_pytorch
rank = int(os.environ['RANK'])
File "/home/shunan_zhl/anaconda3/envs/huofeng1/lib/python3.8/os.py", line 675, in __getitem__
raise KeyError(key) from None
KeyError: 'RANK'
This might help: #18

I can run successfully with the command CUDA_VISIBLE_DEVICES=0,1,2,3 python -m torch.distributed.launch --nproc_per_node=4 tools/train.py ./configs/fastbev/exp/paper/fastbev_m0_r18_s256x704_v200x200x4_c192_d2_f4.py --work-dir=./work_dirs/my/exp/ --launcher="pytorch" --gpus 4. Maybe you can have a try.

Ignite616 commented 1 year ago

I also running on multi-GPUs in single server. When I runpython tools/train.py configs/fastbev/exp/paper/fastbev_m1_r18_s320x880_v200x200x4_c192_d2_f4.py --launcher pytorch,I have an error. Can you guide me that how to set the rank, or how do you realize single-machine multi-card training.
File "/home/shunan_zhl/anaconda3/envs/huofeng1/lib/python3.8/site-packages/mmcv/runner/dist_utils.py", line 18, in init_dist
_init_dist_pytorch(backend, **kwargs)
File "/home/shunan_zhl/anaconda3/envs/huofeng1/lib/python3.8/site-packages/mmcv/runner/dist_utils.py", line 29, in _init_dist_pytorch
rank = int(os.environ['RANK'])
File "/home/shunan_zhl/anaconda3/envs/huofeng1/lib/python3.8/os.py", line 675, in __getitem__
raise KeyError(key) from None
KeyError: 'RANK'
This might help: #18

First of all, thank you for your reply. I refer to # 18 and use the 4th and 5th GPU cards for training. Two GPUs are currently in use, but only calculated on the 5th GPU. How can I use both GPUs to calculate? 3753_1678001358_hd

Ignite616 commented 1 year ago

I also meet an error in the test. I have successfully used m1 model for visualization on a single GPU. In order to see the best results of the fast-bev model, I intend to use the m5 model for visualization. The memory of a GPU is not enough to support the visualization of the m5 model. So I refer to mmdection/tools/dist_ test.sh, to use two GPUs for visualization. my dist_test.bash is as follows. 2023-03-05 11-53-59 的屏幕截图 I run with bash tools/dist_test.sh configs/fastbev/exp/paper/fastbev_m5_r50_s512x1408_v250x250x6_c256_d6_f4.py download_model/m5_epoch_20.pth 2. Then I meet an error, as follow. 2023-03-05 11-46-16 的屏幕截图

please how to solve it? 24=4x6

Ignite616 commented 1 year ago

I also running on multi-GPUs in single server. When I runpython tools/train.py configs/fastbev/exp/paper/fastbev_m1_r18_s320x880_v200x200x4_c192_d2_f4.py --launcher pytorch,I have an error. Can you guide me that how to set the rank, or how do you realize single-machine multi-card training.
File "/home/shunan_zhl/anaconda3/envs/huofeng1/lib/python3.8/site-packages/mmcv/runner/dist_utils.py", line 18, in init_dist
_init_dist_pytorch(backend, **kwargs)
File "/home/shunan_zhl/anaconda3/envs/huofeng1/lib/python3.8/site-packages/mmcv/runner/dist_utils.py", line 29, in _init_dist_pytorch
rank = int(os.environ['RANK'])
File "/home/shunan_zhl/anaconda3/envs/huofeng1/lib/python3.8/os.py", line 675, in __getitem__
raise KeyError(key) from None
KeyError: 'RANK'
This might help: #18
I can run successfully with the command CUDA_VISIBLE_DEVICES=0,1,2,3 python -m torch.distributed.launch --nproc_per_node=4 tools/train.py ./configs/fastbev/exp/paper/fastbev_m0_r18_s256x704_v200x200x4_c192_d2_f4.py --work-dir=./work_dirs/my/exp/ --launcher="pytorch" --gpus 4. Maybe you can have a try.

Thank you for your prompt. I run with the command CUDA_VISIBLE_DEVICES=4,5 python -m torch.distributed.launch --nproc_per_node=2 tools/train.py configs/fastbev/exp/paper/fastbev_m5_r50_s512x1408_v250x250x6_c256_d6_f4.py --launcher="pytorch" --gpus 2 . But I received an error. ` raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

          tools/train.py FAILED

================================================== Root Cause: [0]: time: 2023-03-05_20:42:07 rank: 1 (local_rank: 1) exitcode: -11 (pid: 43458) error_file: <N/A> msg: "Signal 11 (SIGSEGV) received by PID 43458"

Other Failures: <NO_OTHER_FAILURES> **`

Is this because I didn't start with GPU 0. How should I use the fourth and fifth GPUs? Because I don't have permission to use the 0 GPU

sinsin1998 commented 8 months ago

I also running on multi-GPUs in single server. When I runpython tools/train.py configs/fastbev/exp/paper/fastbev_m1_r18_s320x880_v200x200x4_c192_d2_f4.py --launcher pytorch,I have an error. Can you guide me that how to set the rank, or how do you realize single-machine multi-card training.
File "/home/shunan_zhl/anaconda3/envs/huofeng1/lib/python3.8/site-packages/mmcv/runner/dist_utils.py", line 18, in init_dist
_init_dist_pytorch(backend, **kwargs)
File "/home/shunan_zhl/anaconda3/envs/huofeng1/lib/python3.8/site-packages/mmcv/runner/dist_utils.py", line 29, in _init_dist_pytorch
rank = int(os.environ['RANK'])
File "/home/shunan_zhl/anaconda3/envs/huofeng1/lib/python3.8/os.py", line 675, in __getitem__
raise KeyError(key) from None
KeyError: 'RANK'
This might help: #18
I can run successfully with the command CUDA_VISIBLE_DEVICES=0,1,2,3 python -m torch.distributed.launch --nproc_per_node=4 tools/train.py ./configs/fastbev/exp/paper/fastbev_m0_r18_s256x704_v200x200x4_c192_d2_f4.py --work-dir=./work_dirs/my/exp/ --launcher="pytorch" --gpus 4. Maybe you can have a try.

Hello, I try to run this code with 3090, but it's really slow. And I think there may be something wrong with the dataloader. Could you tell me your device and how long the train this model for 20 epochs.

Sense-GVT / Fast-BEV

how to train and test? #9

================================================== Root Cause: [0]: time: 2023-03-05_20:42:07 rank: 1 (local_rank: 1) exitcode: -11 (pid: 43458) error_file: <N/A> msg: "Signal 11 (SIGSEGV) received by PID 43458"