Remove most element-wise iterations on CUDA tensors, which lead to lots of redundant I/O
The forbid ngram module can hardly be optimized because of the highly restricted implementation, here I cast all gpu tensors to python list to make the retrieve value by index process faster.
I checked the decoding output of the optimized code, it's same as previous version.
The default inference settings take 24 hours for decoding, the optimized implementation takes only 2 hours.
TODO:
Re-implement the forbid ngram module, then it can get additional 5 to 10 times speedup. Unlikely I will further work on that.
Remove most element-wise iterations on CUDA tensors, which lead to lots of redundant I/O
The forbid ngram module can hardly be optimized because of the highly restricted implementation, here I cast all gpu tensors to python list to make the retrieve value by index process faster.
I checked the decoding output of the optimized code, it's same as previous version.
The default inference settings take 24 hours for decoding, the optimized implementation takes only 2 hours.
TODO: Re-implement the forbid ngram module, then it can get additional 5 to 10 times speedup. Unlikely I will further work on that.