Zheng-Chong / CatVTON

CatVTON is a simple and efficient virtual try-on diffusion model with 1) Lightweight Network (899.06M parameters totally), 2) Parameter-Efficient Training (49.57M parameters trainable) and 3) Simplified Inference (< 8G VRAM for 1024X768 resolution).
Other
887 stars 104 forks source link

The results are different on Gradio vs Local inference #54

Closed AI-P-K closed 1 month ago

AI-P-K commented 1 month ago

Hello,

I have installed CatVTON locally, downloaded the repository, created a new anaconda environment. I created a custom dataset that has the following:

  1. agnostic-mask -> SCHP
  2. cloth -> original clothing image resized to 1024x768
  3. cloth-mask -> SCHP
  4. image -> original person image resized to 1024x768
  5. image_parse-v3 -> CIHP_PGN
  6. openpose_img -> OpenPose
  7. opnepose_json -> OpenPose

Original input images: dude2 - original person image 8 - original clothing image

Processed data: dude2_mask - agnostic-mask 8 - cloth-mask dude2 - image_parse-v3 dude2_rendered - openpose_img dude2_keypoints.json - openpose_json

Results: dude2 -> Local Result (bad) dude2-dressed -> Online Result (good)

What steps am i missing ? is this a models issues? can you provide some extra information about which models we should use?

Zheng-Chong commented 1 month ago

It seems your mask is not aligned with the expected ones. Try to produce the agnostic masks by modifying provided script to process your custom datasets.

AI-P-K commented 1 month ago

Hello,

I have tried to use preprocess_agnostic_mask.py as you suggested, but now i am a bit confused.. this script looks for some structure that does not match with the structure i provide below.

I am a little bit confused i thought all data that i need is already there, let me add more details to my use case:

The command i use for inference is: python inference.py --dataset "vitonhd" --data_root_path "test_data/" --output_dir "output/"

The structure of my test_data is as follows:

  1. agnostic-mask -> output of SCHP
  2. cloth
  3. cloth-mask -> output of SCHP
  4. image
  5. image_parse-v3 -> output of CIHP_PGN
  6. openpose_img -> output of OpenPose
  7. openpose_json -> output of OpenPose

Can i kindly ask how did you figured that my mask is not aligned with the expected ones?

PS Thank you for your fast reply

Zheng-Chong commented 1 month ago
  1. Your Mask is too close to the human body and does not match the shape of the target clothing, leading to shape issue.
  2. Your Mask eliminates hand information, making it difficult for the model to infer the posture, leading to no arms.
  3. Local inference uses the model for 512X384, while the gradio app use 1024X768 model, leading to bad visual effects.
AI-P-K commented 1 month ago

Hello, 1 and 2 you were correct... i have used the preprocess_agnostic_mask.py and now it dresses the model better. There is still some quality issues like a distorted face and sometimes, distorted clothing.

972dude2

Do you suggest to change the local inference to 1024x768?

AI-P-K commented 1 month ago

So yes, i apologize for the stupid question. You were right with all the above, i appreciate your help and fast response. You have a fantastic day!

AI-P-K commented 1 month ago

p.s changing the inference resolution to 1024x768 locally solved the annomalies issues

Zheng-Chong commented 1 month ago

This depends on what your purpose for local inference is. If it's for generating a large number of high-quality results in parallel, 1024 would be better. If it's for calculating metrics, whether to use 512 or 1024 depends on the baseline you want to compare.

Zheng-Chong commented 1 month ago

I'm glad your issues have been resolved 😄!