Markin-Wang / XProNet

[ECCV2022] The official implementation of Cross-modal Prototype Driven Network for Radiology Report Generation
Apache License 2.0
66 stars 9 forks source link

Are cross-modal feature and cross-model representation vector same? #21

Open CHENG-danyang opened 5 days ago

CHENG-danyang commented 5 days ago

In your parper you write:"we concatenate the visual and textual representations to form the cross-modal features $$r\in \mathbb{R} ^{1\times D}$$", but the formular below writes:" $$o_u=Concate(o_u^{i(f)},ou^t)$$", Are they the same vector? and in this formular: $$PM(k,i)=\frac{1}{N{k,i}^s}\sum_{j=0}^N rj^{k,i}$$ what's the meaning of $$N{k,i}^s$$ ? I didn't find these details in the source code. It is my understand that you first extract visual and textual representation and concate them to form the cross-modal feature $$r_u=Concat(o_u^{i(f)},o^t_u)$$, and grouped them into $$N_l$$ sets{ $$R_k\;0 \le k \le N_l$$ } according to the sample label, then applying K-Means on each $$R_k$$ which split $$R_k$$ into $$N^p$$ cluster. Finally, take the average of the vectors within the cluster as the prototype vector $$PM(k,i)$$ . Is this understanding correct?

Markin-Wang commented 3 days ago

In your parper you write:"we concatenate the visual and textual representations to form the cross-modal features r∈R1×D", but the formular below writes:" ou=Concate(oui(f),out)", Are they the same vector? and in this formular: PM(k,i)=1Nk,is∑j=0Nrjk,i what's the meaning of Nk,is ? I didn't find these details in the source code. It is my understand that you first extract visual and textual representation and concate them to form the cross-modal feature ru=Concat(oui(f),out), and grouped them into Nl sets{ Rk;0≤k≤Nl } according to the sample label, then applying K-Means on each Rk which split Rk into Np cluster. Finally, take the average of the vectors within the cluster as the prototype vector PM(k,i) . Is this understanding correct?

Hi, thank you for your interest to our work. o and r are both the cross-modal features. We use two chracters to refer the cross-modal features as o_u is associate with specific sample u, while r is used to index the cross-modal feature after clustering.

$N^s{k,i}$ , sorry this is a typo here, it should be $N^d{k,i}$.

You are right, the procedure of the prototype initialization is the same as you summarize.

Hope this information could help you figure out the problem.

Best Regards, Jun

CHENG-danyang commented 1 day ago

Your reply helped me a lot, and your work is great.