First of all, congrats on the impressive work! The image reconstruction sanity check is highly inspiring.
I have a question regarding why PackNet uses 3d conv.
I think what the PackNet wants to do is to blend the 2x2 spatial content that is now scattered into the channel dimension. So PackNet used the 3rd dimension to blend the channel. Maybe group conv makes more sense in this application?
Another comment is that the paper mentioned that "2D conv are not designed to directly leverage the tiled structure of this feature space, instead, we propose to first learn to expand this structured representation via a 3d conv layer." I actually did not see in the ablation study how this is the case -- I only see that with 3d conv the results went better, but perhaps this is due to increased parameters in the model?
First of all, congrats on the impressive work! The image reconstruction sanity check is highly inspiring.
I have a question regarding why PackNet uses 3d conv.
I think what the PackNet wants to do is to blend the 2x2 spatial content that is now scattered into the channel dimension. So PackNet used the 3rd dimension to blend the channel. Maybe group conv makes more sense in this application?
Another comment is that the paper mentioned that "2D conv are not designed to directly leverage the tiled structure of this feature space, instead, we propose to first learn to expand this structured representation via a 3d conv layer." I actually did not see in the ablation study how this is the case -- I only see that with 3d conv the results went better, but perhaps this is due to increased parameters in the model?
Thank you very much for your insights!