Closed CRISZJ closed 2 years ago
Hi, @CRISZJ Thanks for your interets for our work. The released pre-trained model use the original self-attention layer which means the selective ratio=1.0 to get the best performance. You can replace the attention layer in the model.py with the selective attention layer, which we give the implementation in L288 in model.py(https://github.com/XLechter/SDT/blob/b8fe7ed7e4eb0cb54271baee510f1d9d833dbfe0/models/model.py#L288), but it will reduce the performance according to the selective ratio you choose.
Hi, @CRISZJ Thanks for your interets for our work. The released pre-trained model use the original self-attention layer which means the selective ratio=1.0 to get the best performance. You can replace the attention layer in the model.py with the selective attention layer, which we give the implementation in L288 in model.py(
https://github.com/XLechter/SDT/blob/b8fe7ed7e4eb0cb54271baee510f1d9d833dbfe0/models/model.py#L288
), but it will reduce the performance according to the selective ratio you choose.
Okay, thanks for your reply. I have another question after reading your paper. In Section4.2. 'Adding a position encoding layer can significantly boost the performance for finding long-range relations'. Could you please tell me the performance gap between PE and no PE? :-)
@CRISZJ It seems I didn't give an ablation study about the PE in the paper. It's about 0.2 CD if i remember correctly which may not that 'significantly' 🤣
@CRISZJ It seems I didn't give an ablation study about the PE in the paper. It's about 0.2 CD if i remember correctly which may not that 'significantly' 🤣
OK, thanks for your reply again. Got it.
Hello, thanks for opening such an excellent job. When I read your code, I found that the code does not seem to use Selective Attention Mechanism. Instead, it uses the Cross-Attention. Do I understand mistakes?