VinAIResearch / LFM

Official PyTorch implementation of the paper: Flow Matching in Latent Space
GNU Affero General Public License v3.0
175 stars 6 forks source link

Does training is stable? #3

Open kleinzcy opened 9 months ago

kleinzcy commented 9 months ago

Hi, authors:

Thanks for your work and code. I have tried to run your code on 2 A100. But the result is ~7, which seems hard to achieve 5.26 on Celeba 256x256. Therefore, I am curious about the stability of training. Do the results vary a lot for several runs?

hao-pt commented 8 months ago

The training is relatively stable and consistent for each single experiment. It is not as varied as your provided results. May you provide us your training hyper-params and the detail of model checkpoint you used for evaluation like checkpoint at which epoch? One practice is to enable --use_ema args when training to mitigate the large oscillation of model performance.

kleinzcy commented 8 months ago

Thanks for your reply. The script I use is as follows:

accelerate launch --main_process_port 33996 --num_processes 2 --exp celeb_f8_dit_g2 \
    --dataset celeba_256 --datadir celeba_hq/celeba-lmdb \
    --batch_size 32 --num_epoch 500 \
    --image_size 256 --f 8 --num_in_channels 4 --num_out_channels 4 \
    --nf 256 --ch_mult 1 2 3 4 --attn_resolution 16 8 4 --num_res_blocks 2 \
    --lr 2e-4 --scale_factor 0.18215 --no_lr_decay \
    --model_type DiT-L/2 --num_classes 1 --label_dropout 0. \
    --save_content --save_content_every 10 

And I use the checkpoint of 474 and 500 epochs for evaluation. I will try to use --use_ema.