cannot run on multi gpus

shencuifeng commented 1 year ago

I start training with this command 'python main.py --base configs/autoencoder/vqmodel1.yaml -t --gpus 4,5' but I got this

everything works fine, steps in one epoch are halved, but only one gpu is in use, and only started one process. How to solve this problem?

OvO1111 commented 11 months ago

I start training with this command 'python main.py --base configs/autoencoder/vqmodel1.yaml -t --gpus 4,5' but I got this everything works fine, steps in one epoch are halved, but only one gpu is in use, and only started one process. How to solve this problem?

have you found a solution yet? I am facing the same issue

newbie2niubility commented 4 months ago

You can run like thie: CUDA_VISIBLE_DEVICES=0,1,2,3,4,5 python -m torch.distributed.launch --nproc_per_node=6 main.py --base your_config.yaml -t --gpus 0,1,2,3,4,5

newbie2niubility commented 4 months ago

--nproc_per_node=x x means the number of gpus

Jiayue2000 commented 3 weeks ago

hello I tried to use multi -gpus but it had the following errors: "torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 270.00 MiB. GPU 3 has a total capacty of 47.54 GiB of which 86.00 MiB is free. Process 75509 has 3.68 GiB memory in use. Process 10579 has 4.12 GiB memory in use. Process 37389 has 35.22 GiB memory in use. Including non-PyTorch memory, this process has 4.43 GiB memory in use. Of the allocated memory 3.60 GiB is allocated by PyTorch, and 57.53 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF" However when I used only 1 gpu, it was ok. And here is my command:"CUDA_VISIBLE_DEVICES=0,1,2,3 python main.py --base models/ldm/fastMRI/config.yaml -t --gpus 0,1,2,3,"

OvO1111 commented 3 weeks ago

@Jiayue2000 like @newbie2niubility said, you should replace python main.py ... with python -m torch.distributed.launch --nproc_per_node=N_GPU main.py .... You can also use torchrun --nproc-per-node=N_GPU main.py ... as in https://pytorch.org/docs/stable/elastic/run.html

Jiayue2000 commented 3 weeks ago

@Jiayue2000 like @newbie2niubility said, you should replace python main.py ... with python -m torch.distributed.launch --nproc_per_node=N_GPU main.py .... You can also use torchrun --nproc-per-node=N_GPU main.py ... as in https://pytorch.org/docs/stable/elastic/run.html

It works! Thank you really much! Meanwhile, I still got a question about finetuning the ldm. I refered other people's work which tried to add the code "model.load_state_dict(torch.load('models/ldm/fastMRI/new_model.ckpt')['state_dict'])"belowing "model = instantiate_from_config(config.model)". Because when I tried to use --resume it has the following errors " Trying to restore training state but checkpoint contains only the model". Did you have the same problem?

OvO1111 commented 3 weeks ago

@Jiayue2000 When trying to resume training from a previous checkpoint, you generally do not need to add another line to load model checkpoint. Following pytorch-lightning's documentation https://pytorch-lightning.readthedocs.io/en/0.8.5/trainer.html it should be straightforward to load whatever checkpoint trained using this framework by appending the --resume_from_checkpoint <ckpt> argument.

Jiayue2000 commented 3 weeks ago

@Jiayue2000 When trying to resume training from a previous checkpoint, you generally do not need to add another line to load model checkpoint. Following pytorch-lightning's documentation https://pytorch-lightning.readthedocs.io/en/0.8.5/trainer.html it should be straightforward to load whatever checkpoint trained using this framework by appending the --resume_from_checkpoint <ckpt> argument.

I understand! Really appreciate for your help! While I tried to change the data to my own datasets, I found the progressive outputs in log image files are not similar with my data. Especially, the"samples_gs" and "samples_x0" which are natural images but my data is MRI medical images. Is it the normal situation at the begin of fine-tuning? Because I wanted to finetuning the ldm to fit the medical image reconstruction, I tried to only finetuning the ldm with pre-trained autoencoder in first-stage. I don't know if it is feasible.

OvO1111 commented 3 weeks ago

@Jiayue2000 I don't think this problem is relevant to the LDM model but to your training/fine-tuning setup, thus maybe we should not continue this discussion on this thread here. From my understanding, it depends largely on the diffusion network rather than the autoencoder to perform actual image generation. I guess you could double check your dataset setting, and note that LDM is trained on 2D images and thus does not naturally support MRI images if it comes in its original 3D format.

haoranwu22 commented 1 week ago

尝试从上一个检查点恢复训练时，通常不需要添加另一行来加载模型检查点。按照 pytorch-lightning 的文档 https://pytorch-lightning.readthedocs.io/en/0.8.5/trainer.html 通过附加参数来加载使用此框架训练的任何检查点应该很简单。--resume_from_checkpoint <ckpt>

我理解！非常感谢您的帮助！当我尝试将数据更改为我自己的数据集时，我发现日志图像文件中的渐进式输出与我的数据不相似。特别是“samples_gs”和“samples_x0”，它们是自然图像，但我的数据是 MRI 医学图像。这是微调开始时的正常情况吗？因为我想微调 ldm 以适应医学图像重建，所以我尝试在第一阶段只使用预先训练的自动编码器对 ldm 进行微调。我不知道这是否可行。

我觉得数据量够的话应该从头训练，微调可能效果不佳

CompVis / latent-diffusion

cannot run on multi gpus #209