Closed zzzucf closed 2 years ago
Great question! In these cases, since you do alignment of the features, you would first learn an alignment matrix that maps from e.g. 2048 to 512, apply this alignment to the classifier vectors, and then compute the cosine similarity in the 512 dimensional space.
I think of what you proposed before and what concerns me the most is that the transformation M would not have unique solution since it is not dxd anymore (dxd', d>>d'). In that case, there exists multiple transformations which might represent completely different directions to transform the classifier. So how can you explain the radioactive would still work when there exists multiple transformed classifier WMs?
Given that it's in the "good" direction (i.e. reducing from d'>d to d), I don't expect this to be a problem.
Hi, from the Section 5.4, Architecture transfer in the original paper, Table 3 was presented with results from different architectures. However, ResNet-50, Densenet-121, and VGG 16 have different feature size as 2048, 1024, 4096, respectively, how did you compute the cosine similarity when the features' size are different? And why we need p-value instead of the angle from cosine? Shouldn't the degree of angle be more direct?