Closed LZ-CH closed 3 months ago
We do not try mixed precision training with our code. Typically, such training can make the training unstable, and fine-tuning the hyper-parameters is required. Maybe you can try to reduce the learning rate and optimize other related codes to make the training more stable.
Close due to inactivity. Please feel free to reopen this issue if you have any further questions.
Prerequisite
Task
I'm using the official example scripts/configs for the officially supported tasks/models/datasets.
Branch
main branch https://github.com/open-mmlab/mmdetection3d
Environment
python 3.9 torch 1.11.0 torchaudio 0.11.0 torchvision 0.12.0
Reproduces the problem - code sample
From EmbodiedScan/embodiedscan/structures/bbox_3d/euler_box3d.py:line 108:
Reproduces the problem - command or script
CUDA_VISIBLE_DEVICES=0,1,2,3 python -m torch.distributed.launch --nproc_per_node=4 tools/train.py configs/grounding/mv-grounding_8xb12_embodiedscan-vg-9dof.py --work-dir=work_dirs/mv-grounding --launcher="pytorch" --amp
Reproduces the problem - error message
Additional information
When I try to use the following method to solve it, it can run normally, but during training, the loss will appear as a nan value.
log: