Open htg17 opened 1 year ago
We provide two implementations of pyramidal attention, namely the naive version and the TVM version, where the Naive version cannot reduce the complexity of time and space. Because the TVM version may require the user to compile the TVM, we set use_tvm=False by default to facilitate the reproduction of our results.
If you want to use the TVM implementation without compiling TVM, please set use_tvm=True and make sure: (1) the operating system is Ubuntu, (2) the CUDA version is 11.1. Otherwise, you can compile TVM 0.8.0 according to their official guide https://tvm.apache.org/docs/.
If you feel too troubled to compile, as an alternative, you can find a compiled TVM docker image from https://tvm.apache.org/docs/install/docker.html#docker-source. Then delete files under 'pyraformer/lib' and run the code again.
Thanks for answering. I just wonder whether the naive choice is pyramidal attention.
If use_tvm=FALSE, the MultiHeadAttention in SubLayers.py is used as self-attention model. But it seems that the MultiHeadAttention is just a vanillla attention.
The Naive implementation implements pyramidal attention by adding an attention mask to the attention score matrix. The 'MultiHeadAttention' module is indeed the vanilla attention. The differences lie in the 'Encoder' module. Please refer to line 19-22 and 51-54 in pyraformer/Pyraformer_LR.py.
In long_range_main.py the use_tvm arg is set to be default FALSE, and in the sample scripts this arg is not triggered. But if this arg is FALSE, it seems that pyramidal attention is not used in the whole model, which is the main contribution of the paper.
So if this arg should be set TRUE when I want to use pyramidal attention to save computation lost?