bcmi / DCI-VTON-Virtual-Try-On

[ACM Multimedia 2023] Taming the Power of Diffusion Models for High-Quality Virtual Try-On with Appearance Flow.
https://arxiv.org/abs/2308.06101
MIT License
387 stars 56 forks source link

Inconsistency between Inference Results and Paper Examples and some failed cases in VITON-HD. #21

Closed wenhao728 closed 10 months ago

wenhao728 commented 11 months ago

Thank you for your contribution once again. We have been working on a VTON (Virtual Try-On) project and have run your inference on VITON-HD unpaired test set. However, we have noticed some inconsistencies between the inference results and the examples shown in the paper.

In one of the examples, the generated neckline looks a little strange compared to the image appended in the paper, where it looks good. Below are the details of the example:

image

Screenshot 2023-10-16 at 19 15 03

Additionally, I followed the guidelines mentioned in the README.md file. The warped clothes were downloaded from the provided link: https://github.com/bcmi/DCI-VTON-Virtual-Try-On/blob/20f75c60498ce1aeb5433b716f3b9cd5b4c71542/README.md?plain=1#L47

The checkpoint used was from the following link: https://github.com/bcmi/DCI-VTON-Virtual-Try-On/blob/20f75c60498ce1aeb5433b716f3b9cd5b4c71542/README.md?plain=1#L72

The only changes I made were to the test.sh file:

  1. I modified the checkpoint/data/output directories.
  2. I decreased the n_samples from 8 to 1, as I only needed one inference due to time constraints. (It took us approximately 4 hours and 40 minutes to complete the inference on the unpaired test set using an A100 server.) Screenshot 2023-10-16 at 19 36 07

I also found some failed cases - When changing from short sleeves to long sleeves, the wrong shape is generated unexpectedly.

I added some lines in test.py to save inputs of pipeline and it looks okay. Screenshot 2023-10-16 at 19 38 43 image

Could you please help me verify if the committed code is correct or if the uploaded checkpoints are correct? Any assistance would be greatly appreciated.

Limbor commented 10 months ago
  1. For the inconsistant results, I have checked it on my device, there seems to be no problem. May be caused by other environmental influences. 00654_00
  2. These failed cases are due to the inpainting mask selection. In subsequent experiments, we updated the inpainting mask to a form similar to HR-VTON. This approach can well alleviate this type of error. image image (person, clothes, new version, old version)
wenhao728 commented 10 months ago

Thanks for your efforts and reply.

  1. Regarding the inconsistency, I will investigate further and try using another device.
  2. As for the failed cases:

When changing from short sleeves to long sleeves, the wrong shape is generated unexpectedly.

  • Image: 00071_00.jpg
  • Clothing item: 02151_00.jpg output

The inpainting mask could be a possible reason for this issue. We have also experimented with a more aggressive masking strategy, which has improved many problematic cases. However, there are still some failed cases, like the aforementioned ones (image=00071_00.jpg and clothes=02151_00.jpg). One possibility we suspect is an overfitting problem.

The unpaired images are from the test set, and theoretically, such issues should not occur. However, the VITON-HD dataset shares many models between the train and test datasets. In fact, based on our observations, each model generally has more than three images with different garments. It is possible that the training dataset includes many images where the model, such as the one in 00071_00.jpg, is wearing short sleeves. Consequently, during the 40-epoch training on UNet, which has over 800 million trainable parameters, the model overfits to this particular distribution and leads to the mentioned issue.

Limbor commented 10 months ago

Overfitting may indeed be the cause of this error. However, in our previous experiments, as the number of training epochs increased, the performance of the model on the test set in the paired setting kept improving, so we ended up training for 40 epochs. Although we have not verified it carefully, there should be no overlapping data in the training set and test set, in which the clothes and person images are the same.

24thTinyGiant commented 9 months ago

@wenhao728 can you tell me how were you able to run the model , because I'm running it on google colab but some errors are coming in the pytorch-lightning ,Please Help

wenhao728 commented 9 months ago

@wenhao728 can you tell me how were you able to run the model , because I'm running it on google colab but some errors are coming in the pytorch-lightning ,Please Help

Hi @24thTinyGiant. I managed to execute the code successfully on the laboratory server, and all appears to be functioning well. Could you please provide the error messages you're experiencing?

Furthermore, if you would like to open a new issue and outlining the problems you've faced, myself and the other contributors would be more than happy to assist you and anyone else who might be dealing with similar difficulties. :)

Here are my environment info using a build-in method in PyTorch

from torch.utils.collect_env import get_pretty_env_info

print(get_pretty_env_info())

Output:

PyTorch version: 1.11.0
Is debug build: False
CUDA used to build PyTorch: 11.3
ROCM used to build PyTorch: N/A

OS: Ubuntu 20.04.5 LTS (x86_64)
GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
Clang version: Could not collect
CMake version: Could not collect
Libc version: glibc-2.31

Python version: 3.8.5 (default, Sep  4 2020, 07:30:14)  [GCC 7.3.0] (64-bit runtime)
Python platform: Linux-5.10.112-005.ali5000.alios7.x86_64-x86_64-with-glibc2.10
Is CUDA available: True
CUDA runtime version: 11.3.109
GPU models and configuration: GPU 0: NVIDIA A100-SXM4-80GB
Nvidia driver version: 470.199.02
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.8.2.0
/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.2.0
/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.2.0
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.2.0
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.2.0
/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.2.0
/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.2.0
HIP runtime version: N/A
MIOpen runtime version: N/A

Versions of relevant libraries:
[pip3] numpy==1.24.4
[pip3] pytorch-lightning==1.4.2
[pip3] torch==1.11.0
[pip3] torch-fidelity==0.3.0
[pip3] torchmetrics==0.6.0
[pip3] torchvision==0.12.0
[conda] blas                      1.0                         mkl  
[conda] cudatoolkit               11.3.1               h2bc3f7f_2  
[conda] ffmpeg                    4.3                  hf484d3e_0    pytorch
[conda] mkl                       2021.4.0           h06a4308_640  
[conda] mkl-service               2.4.0            py38h7f8727e_0  
[conda] mkl_fft                   1.3.1            py38hd3c417c_0  
[conda] mkl_random                1.2.2            py38h51133e4_0  
[conda] numpy                     1.24.4                   pypi_0    pypi
[conda] pytorch                   1.11.0          py3.8_cuda11.3_cudnn8.2.0_0    pytorch
[conda] pytorch-lightning         1.4.2                    pypi_0    pypi
[conda] pytorch-mutex             1.0                        cuda    pytorch
[conda] torch-fidelity            0.3.0                    pypi_0    pypi
[conda] torchmetrics              0.6.0                    pypi_0    pypi
[conda] torchvision               0.12.0               py38_cu113    pytorch
24thTinyGiant commented 9 months ago

this is the error I'm getting in google colab everytime I'm testing the code:

/content/DCI-VTON-Virtual-Try-On 2023-12-04 06:57:16.348203: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9342] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered 2023-12-04 06:57:16.348257: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered 2023-12-04 06:57:16.348305: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered 2023-12-04 06:57:18.551409: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT Global seed set to 23 Loading model from /content/DCI-VTON-Virtual-Try-On/viton512.ckpt Global Step: 58240 LatentTryOnDiffusion: Running in eps-prediction mode DiffusionWrapper has 859.54 M params. making attention of type 'vanilla' with 512 in_channels Working with z of shape (1, 4, 64, 64) = 16384 dimensions. making attention of type 'vanilla' with 512 in_channels ^C

and after this the code stops, can you provide a solution to it ?

wenhao728 commented 9 months ago

Thanks for your information. It looks like a compatibility issue between TensorFlow and cuda. If you want to start run the DCI-VTON method, you can create a new virtual environment following the instructions in README rather than use the default environment of colab. It works well for me.

You don't need to worry about the TensorFlow issue, as this repo depends on PyTorch.

For the TensorFlow issue, you may need to downgrade your package versions as in https://github.com/tensorflow/tensorflow/issues/62075#issuecomment-1808652131

24thTinyGiant commented 9 months ago

okay , Thanks for your reply

sachintha443 commented 2 months ago

@24thTinyGiant @wenhao728 i also got same error in colab did you fix it

24thTinyGiant commented 2 months ago

The issue is due to the limited GPU Ram provided by the Colab ,You need to use higher ran GPU for inference