some problems when running the code with multi-gpus

I have set accelerate to use deepspeed stage 2 (with cpu offloading) via "accelerate config", then using command:accelerate launch --config_file /root/.cache/huggingface/accelerate/default_config.yaml --debug Train.py --model_name=final_multisubject_subj01 --subj=1 --max_lr=3e-4 --mixup_pct=.33 --num_epochs=150 --use_prior --prior_scale=30 --clip_scale=1 --blurry_recon --blur_scale=.5 --n_blocks=4 --hidden_dim=4096 --num_sessions=40 --batch_size=21. However, it hangs on

accelerator.backward(loss) without any logs

Besides, when I tried to train an multi-subject model, another issue appears below

[rank1]: Traceback (most recent call last):
[rank1]:   File "/root/autodl-tmp/MindEyeV2/src/Train.py", line 736, in <module>
[rank1]:     for behav0, past_behav0, future_behav0, old_behav0 in train_dl: 
[rank1]:     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/root/miniconda3/lib/python3.12/site-packages/accelerate/data_loader.py", line 687, in __iter__
[rank1]:     batch = broadcast(batch, from_process=0)
[rank1]:             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/root/miniconda3/lib/python3.12/site-packages/accelerate/utils/operations.py", line 392, in wrapper
[rank1]:     raise DistributedOperationException(
[rank1]: accelerate.utils.operations.DistributedOperationException: Cannot apply desired operation due to shape mismatches. All shapes across devices must be valid.

[rank1]: Operation: `accelerate.utils.operations.broadcast`
[rank1]: Input shapes:
[rank1]:   - Process 0: TensorInformation(shape=torch.Size([6, 1, 17]), dtype=torch.float64)
[rank1]:   - Process 1: TensorInformation(shape=torch.Size([6, 15, 17]), dtype=torch.float64)
[rank1]:   - Process 2: TensorInformation(shape=torch.Size([6, 15, 17]), dtype=torch.float64)
[rank1]:   - Process 3: TensorInformation(shape=torch.Size([6, 3, 17]), dtype=torch.float64)
[rank1]:   - Process 4: [[6, 1, 17], [6, 15, 17], [6, 15, 17], [6, 3, 17]]
W0819 21:27:29.179000 140245539309376 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 11424 closing signal SIGTERM
E0819 21:27:32.547000 140245539309376 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: 1) local_rank: 1 (pid: 11425) of binary: /root/miniconda3/bin/python]

I wonder if you running the python file converted straightly by Train.ipynb? Have you met the same issues?Hoping for your reply

MedARC-AI / MindEyeV2

some problems when running the code with multi-gpus #31