Attention head implementation is different from the paper

Pointy-Hat commented 9 months ago

I have gone through the code for the attention head, and it seem to me that it is wildly different from what is described in the paper. It starts with 3x3x1024 convolutions that take up over 50% of all model's parameters. The whole thing is bizzare, and includes even 1x1x1024 convolutions at the end of both sub-heads. Also the residual connection from the branch outputs is missing.

An illustration that shows the difference:

Maybe this is the reason for the non-reproducible results? I don't think it is the case, but I would be curious to find out.

dddb11 commented 9 months ago

I have gone through the code for the attention head, and it seem to me that it is wildly different from what is described in the paper. It starts with 3x3x1024 convolutions that take up over 50% of all model's parameters. The whole thing is bizzare, and includes even 1x1x1024 convolutions at the end of both sub-heads. Also the residual connection from the branch outputs is missing.

An illustration that shows the difference:

Maybe this is the reason for the non-reproducible results? I don't think it is the case, but I would be curious to find out.

The code of model comes from the official codebase(https://github.com/dong03/MVSS-Net) so I did not notice it before. But I check the code and the first layer and the residual connection in DAHead are indeed different from which are mentioned in the paper.

Pointy-Hat commented 9 months ago

I noticed that too in the original repo, but since the original developers don't respond, I instead chose to share this information with you.

dddb11 commented 9 months ago

I noticed that too in the original repo, but since the original developers don't respond, I instead chose to share this information with you.

I did not notice the DAHead occupied such a large portion of the parameters in the model. And it is interesting that a vanilla FCN with DAHead is able to achieve good performance.

dddb11 / MVSS-Net

Attention head implementation is different from the paper #13