FashionTryOnで学習させる

Yuichi-Sasaki commented 1 year ago

DGX4 ~/projects/fashion_diffusion/diffusers/examples/fashion_to_image 492d4ef

GPU0: VisionEncoderを学習させる

python train_fashion_to_image.py --train_data_dir=/shared/datasets/datasets/fashion/FashionTryOn/v2.0 --use_ema --resolution=224 --center_crop --gradient_accumulation_steps=1 --num_train_epochs=1000 --learning_rate=1e-05 --max_grad_norm=1 --lr_scheduler="constant" --lr_warmup_steps=0 --train_batch_size=48 --train_vision_encoder --output_dir="output/2022-12-28_res224_batch48_lr1em5_trainvisionencoder"

GPU1: VisionEncoderを学習させない

python train_fashion_to_image.py --train_data_dir=/shared/datasets/datasets/fashion/FashionTryOn/v2.0 --use_ema --resolution=224 --center_crop --gradient_accumulation_steps=1 --num_train_epochs=1000 --learning_rate=1e-05 --max_grad_norm=1 --lr_scheduler="constant" --lr_warmup_steps=0 --train_batch_size=48 --output_dir="output/2022-12-28_res224_batch48_lr1em5"

Tensorboard: http://10.0.0.34:6006/?darkMode=true#images

Yuichi-Sasaki commented 1 year ago

上記の中で、評価データが学習データの中からピックアップしていたことに気づいた。それだと、最悪Seedを無視して、入力されたVisionの特徴量だけを手がかりに正解画像を作り出してしまう可能性がある。なので、改めてしまむらの商品画像とtestデータからの抽出画像を評価画像として、再度学習を流した。それ以外のパラメータは全て同じ。

DGX3 ~/projects/fashion_diffusion/diffusers/examples/fashion_to_image https://github.com/Yuichi-Sasaki/fasion_to_image/commit/4c8b780ca1bbc0b3cde5d092822f5630e20512bb

GPU0: VisionEncoderを学習させる

python train_fashion_to_image.py --train_data_dir=/shared/datasets/datasets/fashion/FashionTryOn/v2.0 --use_ema --resolution=224 --center_crop --gradient_accumulation_steps=1 --num_train_epochs=1000 --learning_rate=1e-05 --max_grad_norm=1 --lr_scheduler="constant" --lr_warmup_steps=0 --train_batch_size=48 --train_vision_encoder --output_dir="output/2022-12-29_res224_batch48_lr1em5_trainvisionencoder"

GPU1: VisionEncoderを学習させない

python train_fashion_to_image.py --train_data_dir=/shared/datasets/datasets/fashion/FashionTryOn/v2.0 --use_ema --resolution=224 --center_crop --gradient_accumulation_steps=1 --num_train_epochs=1000 --learning_rate=1e-05 --max_grad_norm=1 --lr_scheduler="constant" --lr_warmup_steps=0 --train_batch_size=48 --output_dir="output/2022-12-29_res224_batch48_lr1em5"

Tensorboard: http://10.0.0.33:6006/?darkMode=true#images

Yuichi-Sasaki commented 1 year ago

DGX4の方、明らかに多様性が失われていて、学習データのみ見ていることが分かる。なので、改めてDGX3の方の評価データを入れてることにした。

添付画像はEpoch=190 (VisionEncoderはフリーズしている版) individualImage

Yuichi-Sasaki commented 1 year ago

あと、DGX3の方も、testデータの中にtrainデータとのコンタミがある (もともとのデータセットからして) ので、結局しまむらドメインのものしか信用できない

Yuichi-Sasaki commented 1 year ago

https://neuralpocket.atlassian.net/wiki/spaces/NP/pages/374439937/Diffusion+Model+VITON#2023-01-04%E6%99%82%E7%82%B9%E3%81%AE%E7%B5%8C%E9%81%8E に書いたように、ドメイン外に対する性能が悪い。

これの対策のために、

学習データの増強
平置き画像へのaugmentationの適用の2つを行った。

(fashion_diffusion) y_sasaki@DGX0004:~/projects/fashion_diffusion/diffusers/examples/fashion_to_image$ python train_fashion_to_image.py --train_data_dir /shared/datasets/datasets/fashion/FashionTryOn/v2.0 /shared/datasets/datasets/fashion/lookbook/v1.0 /shared/datasets/datasets/fashion/MPV/v1.0 /shared/datasets/datasets/fashion/VITON-VR/v1.0 --use_ema --resolution=224 --gradient_accumulation_steps=1 --num_train_epochs=1000 --learning_rate=1e-05 --max_grad_norm=1 --lr_scheduler="constant" --lr_warmup_steps=0 --train_batch_size=48 --output_dir="output/2023-01-04_res224_batch48_lr1em5_trainvisionencoder" --random_aug_hiraoki

で実行している。

見た目、visionencoderの学習の有無で性能は大差なさそうなので、visionencoderはフリーズしたままで行う。

画像枚数は、42394から89711。

Yuichi-Sasaki commented 1 year ago

上記を実行した後、バグに一つ気づいたので止めた。直接的な修正はこの箇所 https://github.com/Yuichi-Sasaki/fasion_to_image/commit/a62c4a119f7b1a02bf7f91022f8fafb5f6e7eacf#diff-bf7007eec38142ad895cd6469da64a77161d08c087d33abb60b2181f4238591eL489

つまり、unetは学習されていたが、converterは学習されていなかった。 vision encoder側からの情報がかなりランダムになっていたことが予想される。少なくとも、unetの後半部分は全部学習され直しになっていたことが予想される。

修正に関しては、この際、全てのモジュールの学習可否を調整できるようにした。

うまく修正できていれば、理論的には、converterのみ学習させてもある程度服の様子は再現できるはずなので、その条件で再度実行:

python train_fashion_to_image.py --train_data_dir /shared/datasets/datasets/fashion/FashionTryOn/v2.0 --use_ema --resolution=224 --gradient_accumulation_steps=1 --num_train_epochs=1000 --learning_rate=1e-05 --max_grad_norm=1 --lr_scheduler="constant" --lr_warmup_steps=0 --train_batch_size=48 --output_dir="output/2023-01-04_res224_batch48_lr1em5_train_converter" --train_converter

python train_fashion_to_image.py --train_data_dir /shared/datasets/datasets/fashion/FashionTryOn/v2.0 /shared/datasets/datasets/fashion/lookbook/v1.0 /shared/datasets/datasets/fashion/MPV/v1.0 /shared/datasets/datasets/fashion/VITON-VR/v1.0 --use_ema --resolution=224 --gradient_accumulation_steps=1 --num_train_epochs=1000 --learning_rate=1e-05 --max_grad_norm=1 --lr_scheduler="constant" --lr_warmup_steps=0 --train_batch_size=48 --output_dir="output/2023-01-04_res224_batch48_lr1em5_data_4_train_converter" --train_converter

Yuichi-Sasaki commented 1 year ago

上の方の結果@1epochを貼り付け。「女性者の服」というところは認識するようになっているが、VisionEncoderからの情報抽出だとこれが限界の様子。

次にfreezeを解くのはどこか？

unet
vision_encoder

2つを同時に実行させた:

python train_fashion_to_image.py --train_data_dir /shared/datasets/datasets/fashion/FashionTryOn/v2.0 --use_ema --resolution=224 --gradient_accumulation_steps=1 --num_train_epochs=1000 --learning_rate=1e-05 --max_grad_norm=1 --lr_scheduler="constant" --lr_warmup_steps=0 --train_batch_size=48 --output_dir="output/2023-01-04_res224_batch48_lr1em5_data_1_train_converter_unet" --train_converter --train_unet

python train_fashion_to_image.py --train_data_dir /shared/datasets/datasets/fashion/FashionTryOn/v2.0 --use_ema --resolution=224 --gradient_accumulation_steps=1 --num_train_epochs=1000 --learning_rate=1e-05 --max_grad_norm=1 --lr_scheduler="constant" --lr_warmup_steps=0 --train_batch_size=48 --output_dir="output/2023-01-04_res224_batch48_lr1em5_data_1_train_converter_visionencoder" --train_converter --train_vision_encoder

Yuichi-Sasaki commented 1 year ago

https://openreview.net/pdf?id=0J6afk9DqrR DGX3の方では、上記のattention only fine-tuningを実装して動かしている

Yuichi-Sasaki / fasion_to_image

FashionTryOnで学習させる #4