Training parameters and time & cifar10.train.npz

parkjjoe commented 6 months ago

Dear Authors,

Thank you for your code.

I would like to reproduce the results presented in your paper, and I would like to know what parameters were used to train the SDDPM_CIFAR10.pt. For example, I want to know batch_size, learning rate (lr), total_steps, etc. And I want to know the training time to create SDDPM_CIFAR10.pt. Is it really true that it took 32 days with A100?
I used the code CUDA_VISIBLE_DEVICES=2,3 python main_SDDPM.py \ --train \ --dataset='cifar10' \ --beta_1=1e-4 --beta_T=0.02 \ --img_size=32 --timestep=4 --img_ch=3 \ --parallel=True --sample_step=0 \ --total_steps=50001 \ --logdir='./logs' \ --wandb with 2 GPUs, but the training time didn't displayed. Like this:
```
Training from scratch
epsilon
Model params: 63.61 M
0%|                          | 0/50001 [00:00<?, ?it/s]
```

Are there other ways to use GPUs in parallel?

How to get cifar10.train.npz? I also want to get cifar10.test.npz with the same preprocessing method.

parkjjoe commented 6 months ago

I think I solved the second problem by writing os.environ["NCCL_P2P_DISABLE"]="1".

AndyCao1125 commented 5 months ago

Thanks for your questions.

what parameters were used to train the SDDPM_CIFAR10.pt; And I want to know the training time.

We trained the SDDPM_CIFAR10.pt using the default settings as specified in the main.sh. The training process took approximately ~32 A100 days. We spent ~3 to 4 days to trained this model with 8 A100 40GB GPUs. To potentially speed up the process, you might consider using Hugging Face Accelerate.

How to get cifar10.train.npz?

This file represents pre-calculated statistics for the CIFAR10 dataset, originally sourced from pytorch-ddpm. If you're interested in generating the statistics for any dataset, clean-fid library is a recommended tool.

AndyCao1125 / SDDPM

Training parameters and time & cifar10.train.npz #4