FutabaSakuraXD / Farewell-to-Mutual-Information-Variational-Distiilation-for-Cross-Modal-Person-Re-identification

54 stars 6 forks source link

关于VSD中KL散度计算的一点疑问 #5

Closed caixincx closed 1 year ago

caixincx commented 2 years ago

作者您好,您在VSD中计算了p(y|v)和p(y|z)的散度,计算的代码如下:

 vsd_loss = kl_div(input=self.softmax(i_observation[0].detach() / self.args.temperature),
                          target=self.softmax(i_representation[0] / self.args.temperature))

但是根据我的理解,p(y|v)和p(y|z)的KL散度应该是如下计算:

vsd_loss = kl_div(input=torch.nn.LogSoftmax(dim=1)(i_representation[0] / self.args.temperature),
                          target=self.softmax(i_observation[0].detach() / self.args.temperature))

请问为什么您在代码中是使用第一种实现方式?这两种实现方式是等价的吗?

Qinying-Liu commented 1 year ago

我也感觉很奇怪 第一种形式不仅形式上不对,而且感觉完全没有梯度

caixincx commented 1 year ago

我也感觉很奇怪 第一种形式不仅形式上不对,而且感觉完全没有梯度 是的,我做实验发现这一项的loss很小,几乎没啥影响

FutabaSakuraXD commented 1 year ago

Check the following contents before using the KL divergence in pytorch:

https://zhuanlan.zhihu.com/p/575809052?utm_id=0

https://pytorch.org/docs/stable/generated/torch.nn.KLDivLoss.html#torch.nn.KLDivLoss

Or try with ur own formats if you want : )

--------------原始邮件-------------- 发件人:"caixincx @.>; 发送时间:2022年11月23日(星期三) 中午11:57 收件人:"FutabaSakuraXD/Farewell-to-Mutual-Information-Variational-Distiilation-for-Cross-Modal-Person-Re-identification" @.>; 抄送:"Subscribed @.***>; 主题:Re: [FutabaSakuraXD/Farewell-to-Mutual-Information-Variational-Distiilation-for-Cross-Modal-Person-Re-identification] 关于VSD中KL散度计算的一点疑问 (Issue #5)

我也感觉很奇怪 第一种形式不仅形式上不对,而且感觉完全没有梯度 是的,我做实验发现这一项的loss很小,几乎没啥影响

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you are subscribed to this thread.Message ID: @.***>

Qinying-Liu commented 1 year ago

Hi FutabaSakuraXD. Thank you for your reply. For the standard form of torch.nn.KLDivLoss or _F.kldiv, I think that we have to first perform a log operation on the inputs.
image Moreover, the grad of the inputs should not be detached.

FutabaSakuraXD commented 1 year ago

Gradients of the observation are detached to prevent degeneration.