DirtyHarryLYL / Transferable-Interactiveness-Network

Code for Transferable Interactiveness Knowledge for Human-Object Interaction Detection. (CVPR'19, TPAMI'21)
MIT License
227 stars 41 forks source link

2 questions about CNN architecture #4

Closed csyanbin closed 5 years ago

csyanbin commented 5 years ago

Really wonderful work!

I have 2 questions about the architecture details. (1 ) In Sec 5.2, "Relatively, the spatial stream is composed of two convolutional layers with max pooling, and two 1024 sized FCs", two 1024FCs are used in the spatial stream of C. In Figure3, it seems 4 are used as the H and O stream.

Which one is correct?

(2) In iCAN paper, the Residual Block has 2048 channels. In my understanding, you used 1024 in your paper instead and followed by 4 1024 FCs. Am I right?

DirtyHarryLYL commented 5 years ago
  1. If you are working on hico-det, two 1024 fcs maybe better. PS, if using more data to train the P, sometimes bigger capacity is better. We have tried both two and four fcs. Different datasets, e.g. hico-det, v-coco, openimage, may have different performance, especally in joint training of P and C.
  2. res-50 block--2048 channels tensor--GAP--1024 fcs--...
csyanbin commented 5 years ago

Thanks for the reply.

In P module,

                  H-stream -- res-block -- 2fcs
                  O-stream -- res-block --2 fcs
                  SP-stream -- conv -- 2fcs

In C module,

                  H-stream -- res-block -- 4fcs
                  O-stream -- res-block -- 4fcs
                  S-stream -- conv -- 2fcs

Is this structure used in the paper?

DirtyHarryLYL commented 5 years ago

Yeah, notably if you use a larger C and train P and C jointly. The convergence may not be synchronous. Usually, C need more time to achieve better performance.