why detach the gradient of appearance codes?

SDret commented 1 year ago

Thanks in advance for this wonderful work! When I review the paper, I find nothing confusing about the Eq.4-6 for cross-identity losses. However, when I examine the code, I find the gradient of appearance code is actually detached(line#142 in reIDmodel.py), it means that the only loss imposed on the encoder of appearance code is just ReID loss, i.e., the appearance codes are just features learned for ReID task exclusively. Then, here goes a problem about Eq.4-6, as they require the reID feature and prob. the same for the x_a and x_ab, where x_ab is the identity in x_b (see Eq.9, it implies that the id of x_ab depends on b instead of a) and appearance from x_a. In other word, Eq.4-6 requires a simple reID model to output the same prediction for two images of different persons.

So, intuitively, if running the given code, the ID in generated X_ab would inevitably shifted from X_b since it is required that x_ab both contain the id information in x_a and x_b. And in experiments I find similar phenoemenon that a man is transfer into woman or the verse. So the feature detaching seems confusing.

If the gradient of appearance code is not detached, Eq.4-6 seems reasonable, since it can be tasked to learn identity-irrelevant appearance code. So, I wonder why the appearance codes gradient is detached from decoder and discriminators.

Looking forward to your response :D

layumi commented 1 year ago

Hi @SDret

Thank you for your attention on our paper.

Yes. In practise,

we do not want the data generation to affect the appearance part. So we do not allow the generation loss to optimize the ABNet via f in code (line#142 in reIDmodel.py).
we only use the re-ID loss (classification loss) to update the appearance model parameters.
If we do allow the appearance and structure learning simultaneously, it may also lead to over-fitting, since we provide too much ``freedom'' to do reconstruction. So fix appearance can be good for training and good for initialization.

Thank you.

SDret commented 1 year ago

@layumi Thanks for the quick response!

I understand your listed purposes for gradient detaching. However, my point is that although it is necessary and can function well for performance, however, with some potential risk for the downstream use of data augmentation of reID (i.e. fine-grain learning introduced in paper).

To be specific, in the paper, a figure is presented that a man in red cloth is generated. With Eq.4-6, it is expected that this generated image should be identified as the 'girl' who supplies only the appearance code of red clothing. Since the encoder of appearance code is trained exclusively by the reID loss, to satisfy Eq.4-6, more beyond-the-appearance ID information of the girl (like gender, body size, shape, etc) would be embedded into the generated image to maximumly lower the reID loss of the girl, which in turn inevitably makes the man in the generated image not that close to the man in original image, i.e., the pedestrian in the generated image would be a mix of the girl and man.

Since the structure encoder is much low-level, so I guess that the id in image providing structure code would dominate the generated image, it would explain why the person in generated image is more close to the man. However, in some casees of my experiment, I find that the ID in generated image actually shifts from the one giving structure code, like from a man to woman. So I am worried that, the generated image can be quite ambiguous in term of the id-belongingness of the person within, and it can bring unexpected harm if we use it for data augmentation mentioned in fine-grained learning.

It's a little bit lengthy, thanks for the response again, its actually a wonderful work!

layumi commented 1 year ago

Thank you @SDret

Yes. There has to be some trade-off, since we mix the data without strong ground-truth (How the image should be.)

One thing you also may need to take care is the decoder. We use an AdaIN-based Decoder (early than StyleGAN2) which has relatively less modification on the low-level part (body size, shape).

SDret commented 1 year ago

Thank you @SDret

Yes. There has to be some trade-off, since we mix the data without strong ground-truth (How the image should be.)

One thing you also may need to take care is the decoder. We use an AdaIN-based Decoder (early than StyleGAN2) which has relatively less modification on the low-level part (body size, shape).

Yes, since only the style codes are transferred into generator, the mix of id information can be ignorable, thank u for your answer and bringing such a inspiring work!

NVlabs / DG-Net

why detach the gradient of appearance codes? #80