Zhangjinso / PISE

123 stars 28 forks source link

Problems about the generated target parsing results of your pre-trained model #21

Closed happyday521 closed 3 years ago

happyday521 commented 3 years ago

Hi, I have a problem about the generated target parsing results of your pre-trained model for human pose transfer. Using your pre-trained checkpoint, I visualize the generated target parsing results. (i.e., self.parsav in class Painet(BaseModel)). As shown in the figure, however, it seems to exist some problems. 1

1、 It seems that the ParsingNet can only effectively generate parsing maps of some limited regions(e.g., '3':upper clothes and '5' : lower clothes(pants, shorts)), but cannot tackle other regions(e.g., skin, face, hair etc.). 2、It seems that the generated target parsing result is offset to the left relative to GT, which should be located in the middle of the image. In other words, the generated target parsing result is not aligned with the input target pose(i.e., self.input_BP2) in the spatial position. In fact, using your pre-trained checkpoint, the generated target image result is also offset to the left relative to GT, as shown in the figure.
2

I’m not sure if it’s the problem with your model? Please check it. I would be very grateful if you can provide your visualization results!

Thanks!

Zhangjinso commented 3 years ago

Hi, I also found this visual result when we did this project. I guess it may be due to using the logits as the 'parsing result' for the image generator for human pose transfer, which may lead to a lower loss with limited regions. This is also the reason we argue 'clothing' in the paper (like 'decouple shape and style of clothing'). It can be revised by training the texture transfer model for image editing, however, you may get slightly worse results for human pose transfer. For the second problem, it is true that the checkpoint is trained uses a 256*256 pose input due to negligence. Though the parsing generator gets expected results, it leads to another unaligned problem between inputs and outputs. The parsing generator can be finetuned to get more stable results. I am working on solving drawbacks and try to improve this work. Thank you for your interest~

happyday521 commented 3 years ago

Thanks for your reply! I still have some questions to ask. 1、For problem 1, according to your explanation,the reason that you train the texture transfer model ( by commenting the line 177 and 178, and uncommenting line 162-176) is to improve the quality of generated parsing result for better image editing. Am I right?

In contrast, when training the human pose transfer model, using the logits as the 'parsing result'(by commenting line 162-176 and uncommenting the line 177 and 178)aims to get a lower loss, although it looks not good.

2、For problem 2,what does ‘’256*256 pose input“ mean? Should your model be trained using 256x176 pose input? Besides, following your README and using your current checkpoint , I can't reproduce the same results as your provided test results. Can you provide the pre-trained checkpoint corresponding to your provided test results?

Thanks very much!

Zhangjinso commented 3 years ago

You can re-test the model after changing the $old_size$ in data/fashion_data.py from (256,256) to (256, 176), then you can get the results we provided. This setting is from GFLA, and I trained and tested the model in this setting. This setting does not lose pose information, but if train the model in this setting, it will have an unaligned problem between the pose and parsing. So I change it to (256,256), which makes the pose and parsing align. This does not matter if you re-train the model.

Anyway, change this parameter and get the results.

happyday521 commented 3 years ago

Got it!Thanks very much.

Mathilda88 commented 3 years ago

Hi @eternitjl

May you please help me through the way I need to proceed for visualizing the parsing maps. Highly appreciate it if you could write a simple snippet of codes.

Thank you in advance,

Zhangjinso commented 3 years ago

The function tensor2im can do that. You can visualize the parsing map by changing the 'need_dec' from 'False' to 'True'. Note that the image_tensor you input is a one-dimension tensor after 'argmax'.

Mathilda88 commented 3 years ago

Thank you for your response

jiaxiangshang commented 3 years ago

You can re-test the model after changing the $old_size$ in data/fashion_data.py from (256,256) to (256, 176), then you can get the results we provided. This setting is from GFLA, and I trained and tested the model in this setting. This setting does not lose pose information, but if train the model in this setting, it will have an unaligned problem between the pose and parsing. So I change it to (256,256), which makes the pose and parsing align. This does not matter if you re-train the model.

Anyway, change this parameter and get the results.

Hi, jinsong @Zhangjinso Really thanks for your nice work and clean code. For the "old_size" problem, I want to make it more clear, since I am going to train to reproduce your result.

  1. you using 256, 256 as "old_size" in training, because of the alignment from pose to parse, the related function is "cords_to_map" which only affect the pose map size.
  2. for 256, 176, you call us to test on this setting, but if we training on 256, 256, how can we test on 256, 176.
Zhangjinso commented 3 years ago

The pose should align with the parsing. I trained this model with(256, 176), which causes the misalignment with poses and parsings. Though the CNN model can deal with this regular pattern, it is better to make them align if you re-train it. Therefore, if you would like to test with my pre-trained model, you need to keep it consistent with my setting(256, 176). If you would like to test with your checkpoint, just keep your training setting.

jiaxiangshang commented 3 years ago

Ok, it really clear. Thank you so much.