Open cascat0 opened 6 days ago
And I have a question.
In the paper, you mentioned that "Multiple surgical workflow analysis models like OperA [5], SAHC [7], and Trans-SVNet [11] incorporated Transformer layers to TCNs in order to efficiently combine the spatial and temporal features. Nonetheless, their dependence on TCN modeling leads to a loss of finer-grained information, and using temporalagnostic backbones limits frame embeddings to capture only spatial information."
Why does TCN modeling lose fine-grained information? I'm a little confused about this.
Great work! Looking forward to open source code.
November 30th is only 10 days away, hahaha.