Reproducibility with ground-truth pairs

jnhwkim commented 2 years ago

I am trying to reproduce "Training with ground-truth pairs" using the command line:

python train.py --gpus=4 --outdir=./outputs/ --temp=0.5 --itd=10 --itc=20 --gamma=10 --data=./datasets/COCO2014_train_CLIP_ViTB32.zip --test_data=./datasets/COCO2014_val_CLIP_ViTB32.zip --mixing_prob=0.0

with the downloaded CLIP features. I didn't touch any hyperparameter. I resumed the training with --resume path/to/pkl.

Unfortunately, I have to resume the training but I notice this is unlikely reproducible until 25,000 kimg targeting FID=8.6 as in the pretrained model (I checked the pretrained model, and it's fine.)

3,024 kimg got 17.425 In this moment, I resumed. 3,024 (previously) + 3024 (resumed log says) = 6,048 kimg got 14.27 3,024 (previously) + 6048 (resumed log says) = 9,072 kimg got 12.65 3,024 (previously) + 9,072 (resumed log says) = 12,096 kimg got 12.15 3,024 (previously) + 14,515 (resumed log says) = 17,539 kimg got 11.96 In this moment, I resumed. 17,539 (previously in total) + 0 (resumed log says) = 17,539 kimg got 11.93 (confirmed correctly resumed) 17,539 (previously in total) + 8,064 (resumed log says) = 25,603 kimg got 11.55

So, I am wondering if the resuming mechanism may hurt the optimizing integrity, for example, the states of optimizers? Do you have any suggestions for this situation?

If you can share your training log, it will be helpful, by the way.

drboog commented 2 years ago

That's interesting, I also got FID around 17 at 3000 kimgs. But I don't think resume training will be harmful, because I actually also tried experiments with resuming mechanism (see logs below, which is not the log of pre-trained model we provided, but have similar results).

I first trained the model for around 9000 kimgs: {"results": {"fid50k_full": 292.729191427644}, "metric": "fid50k_full", "total_time": 430.86630034446716, "total_time_str": "7m 11s", "num_gpus": 4, "snapshot_pkl": "network-snapshot-000000.pkl", "timestamp": 1643964284.2860408} {"results": {"fid50k_full": 33.96575281574211}, "metric": "fid50k_full", "total_time": 191.65116810798645, "total_time_str": "3m 12s", "num_gpus": 4, "snapshot_pkl": "network-snapshot-001008.pkl", "timestamp": 1643976982.565247} {"results": {"fid50k_full": 20.741864805036187}, "metric": "fid50k_full", "total_time": 191.47300815582275, "total_time_str": "3m 11s", "num_gpus": 4, "snapshot_pkl": "network-snapshot-002016.pkl", "timestamp": 1643989709.3447502} {"results": {"fid50k_full": 16.797841481759033}, "metric": "fid50k_full", "total_time": 195.875905752182, "total_time_str": "3m 16s", "num_gpus": 4, "snapshot_pkl": "network-snapshot-003024.pkl", "timestamp": 1644002436.0043874} {"results": {"fid50k_full": 14.67041915357955}, "metric": "fid50k_full", "total_time": 194.41739296913147, "total_time_str": "3m 14s", "num_gpus": 4, "snapshot_pkl": "network-snapshot-004032.pkl", "timestamp": 1644015164.3182662} {"results": {"fid50k_full": 13.423259840465683}, "metric": "fid50k_full", "total_time": 190.13487648963928, "total_time_str": "3m 10s", "num_gpus": 4, "snapshot_pkl": "network-snapshot-005040.pkl", "timestamp": 1644027883.5875926} {"results": {"fid50k_full": 12.300371939370304}, "metric": "fid50k_full", "total_time": 192.6750693321228, "total_time_str": "3m 13s", "num_gpus": 4, "snapshot_pkl": "network-snapshot-006048.pkl", "timestamp": 1644040634.8567796} {"results": {"fid50k_full": 11.629757412665626}, "metric": "fid50k_full", "total_time": 192.56703448295593, "total_time_str": "3m 13s", "num_gpus": 4, "snapshot_pkl": "network-snapshot-007056.pkl", "timestamp": 1644053381.6910315} {"results": {"fid50k_full": 11.196217115746585}, "metric": "fid50k_full", "total_time": 188.28850984573364, "total_time_str": "3m 08s", "num_gpus": 4, "snapshot_pkl": "network-snapshot-008064.pkl", "timestamp": 1644066129.6007373} {"results": {"fid50k_full": 10.898213003002159}, "metric": "fid50k_full", "total_time": 190.4818983078003, "total_time_str": "3m 10s", "num_gpus": 4, "snapshot_pkl": "network-snapshot-009072.pkl", "timestamp": 1644078865.3335092}

Then resumed training for around 1000 kimgs. {"results": {"fid50k_full": 10.362491939030358}, "metric": "fid50k_full", "total_time": 171.25584268569946, "total_time_str": "2m 51s", "num_gpus": 8, "snapshot_pkl": "network-snapshot-001008.pkl", "timestamp": 1644098539.1100378}

Then resumed training for another 16000 kimgs {"results": {"fid50k_full": 10.191172412536735}, "metric": "fid50k_full", "total_time": 180.1323823928833, "total_time_str": "3m 00s", "num_gpus": 8, "snapshot_pkl": "network-snapshot-000000.pkl", "timestamp": 1644106634.0353086} {"results": {"fid50k_full": 9.706954041760678}, "metric": "fid50k_full", "total_time": 172.08733463287354, "total_time_str": "2m 52s", "num_gpus": 8, "snapshot_pkl": "network-snapshot-002048.pkl", "timestamp": 1644118232.7029238} {"results": {"fid50k_full": 9.304561444498137}, "metric": "fid50k_full", "total_time": 171.30949354171753, "total_time_str": "2m 51s", "num_gpus": 8, "snapshot_pkl": "network-snapshot-004096.pkl", "timestamp": 1644129841.102044} {"results": {"fid50k_full": 9.100792667599864}, "metric": "fid50k_full", "total_time": 175.20172905921936, "total_time_str": "2m 55s", "num_gpus": 8, "snapshot_pkl": "network-snapshot-006144.pkl", "timestamp": 1644141458.7953303} {"results": {"fid50k_full": 8.96150783859931}, "metric": "fid50k_full", "total_time": 175.32514882087708, "total_time_str": "2m 55s", "num_gpus": 8, "snapshot_pkl": "network-snapshot-008192.pkl", "timestamp": 1644153088.8084908} {"results": {"fid50k_full": 8.635716508539934}, "metric": "fid50k_full", "total_time": 167.41019773483276, "total_time_str": "2m 47s", "num_gpus": 8, "snapshot_pkl": "network-snapshot-010240.pkl", "timestamp": 1644164742.4412928} {"results": {"fid50k_full": 8.686857617411096}, "metric": "fid50k_full", "total_time": 171.3067090511322, "total_time_str": "2m 51s", "num_gpus": 8, "snapshot_pkl": "network-snapshot-012288.pkl", "timestamp": 1644176417.0857143} {"results": {"fid50k_full": 8.791309287147953}, "metric": "fid50k_full", "total_time": 170.39617896080017, "total_time_str": "2m 50s", "num_gpus": 8, "snapshot_pkl": "network-snapshot-014336.pkl", "timestamp": 1644188071.240224} {"results": {"fid50k_full": 8.305397224680666}, "metric": "fid50k_full", "total_time": 169.89115047454834, "total_time_str": "2m 50s", "num_gpus": 8, "snapshot_pkl": "network-snapshot-016384.pkl", "timestamp": 1644199712.4696035}

drboog commented 2 years ago

Are you using the revised contrastive loss, i.e. computing it across multi-GPUs?

jnhwkim commented 2 years ago

@drboog In this reproduction check procedure, I don't touch any code. Just clone it (36a717a1501f5511d0250077a56002f0937b184a) and run.

drboog commented 2 years ago

That's weird, I believe this code on github should be consistent with code on my local machine, although I have added some new functions on my local machine.

drboog commented 2 years ago

Can you try other hyper-parameter settings? e.g. itd=10, itc=10. Meanwhile I will clone this version from github to my local machine to see whether I can find some problems.

drboog commented 2 years ago

@jnhwkim

https://github.com/drboog/Lafite/blob/180dd9b7c0e876a964e5cc7dbf4d8183d0c08b4d/training/loss.py#L149

I think the lam=0., temp=0.5, itc=10 were tuned based on a previous, incorrect implementation, in which I did an exponent operation by mistake (see link). I believe the correct implementation should lead to better results, however, you have to re-tune all the hyper-parameters.
But if you only want to reproduce the results, I think you can use this incorrect implementation and try python train.py --gpus=4 --outdir=./outputs/ --temp=0.5 --itd=5 --itc=10 --gamma=10 --mirror=1 --data=./datasets/COCO2014_train_CLIP_ViTB32.zip --test_data=./datasets/COCO2014_val_CLIP_ViTB32.zip --mixing_prob=0.0 or python train.py --gpus=4 --outdir=./outputs/ --temp=0.5 --itd=10 --itc=10 --gamma=10 --mirror=1 --data=./datasets/COCO2014_train_CLIP_ViTB32.zip --test_data=./datasets/COCO2014_val_CLIP_ViTB32.zip --mixing_prob=0.0

Please let me know if this works.

drboog commented 2 years ago

I trained a model from scratch with this implementation, and reached FID of 12 at 7000 kimgs. So I think it should work.

jnhwkim commented 2 years ago

@drboog Ouch! I think I missed that bug when I checked the code.

You suggested the lower values (5, 10) of itd and itc compared to the paper's suggestion (10, 20). Does the correct implementation use 10 and 20, or where did these (10, 20) come from?
I am afraid --mirror does not work as you intended when using preprocessed CLIP features: https://github.com/drboog/Lafite/blob/180dd9b7c0e876a964e5cc7dbf4d8183d0c08b4d/training/dataset.py#L131 it just gets one of the previously processed clip-features.

drboog commented 2 years ago

I don't think you missed the bug, I corrected the code when I uploaded it, so the code on github was "sim = sim/temp" before today. But I think using the incorrect one can reproduce the results more easily, so I just revised it and add comments today. In summary, I suggest you use that "sim=torch.exp(sim/temp)" and temp=0.5, itd=5 (or 10), itc=10 if you only want to reproduce the results.

The temp=0.5, itd=10, itc=20 comes from a different setting I'm currently doing, in which I used the correct implementation (sim=sim/temp). I thought it will work well. However, it seems that they are not the right hyper-parameters for training from scratch. With the correct version, I think you have to tune temperature instead of using temp=0.5.

The mirror will not influence the CLIP features, I think it can show more samples to the discriminator. But anyway, I think it may not be very important here.

jnhwkim commented 2 years ago

@drboog what do you mean by "it can show more samples to the discriminator"? What I understand is that --mirror option is doing nothing in this implementation when we use the CLIP features.

Do you have a plan to provide the updated hyperparameters for the correct implementation?

drboog commented 2 years ago

Yes, mirror has nothing to do with CLIP features. But we will feed flipped images to the discriminator, although we only have corresponding features before flip. Let X, T be image and text, we will have (X, X_feature, T_feature) and (flipped X, X_feature, T_feature) to be fed. Will this (flipped X, X_feature, T_feature) be helpful or harmful? I'm not sure.

Currently I don't have enough time and resources to re-run all the training-from-scratch experiments with correct implementation.

jnhwkim commented 2 years ago

Oh, I see. You feed the flipped X to the discriminator for the GAN loss.
No pressure. Then, the exp would work as a strong sharpener than the lower temperature I think.
I'm trying to reproduce following your guidance. Let you know if I got the one. Thanks for the help.

jnhwkim commented 2 years ago

I confirmed the current 510ba69 commit could reproduce the FID score for the standard-setting. When I was training 26,610 kimg, I got 8.45, and an additional 8,265 kimg gave me 8.05.

I suggest updating your paper regarding the exponential sharpening in contrastive losses accordingly. Without this, I found that hyperparameter tuning is not straightforward and hard to reproduce.

Anyway, I think it is solved and I close this issue.

drboog commented 2 years ago

Glad to hear that you have reproduced it.

drboog / Lafite

Reproducibility with ground-truth pairs #4