12 times speedup for inference

yaolu commented 4 years ago

Remove most element-wise iterations on CUDA tensors, which lead to lots of redundant I/O

The forbid ngram module can hardly be optimized because of the highly restricted implementation, here I cast all gpu tensors to python list to make the retrieve value by index process faster.

I checked the decoding output of the optimized code, it's same as previous version.

The default inference settings take 24 hours for decoding, the optimized implementation takes only 2 hours.

TODO: Re-implement the forbid ngram module, then it can get additional 5 to 10 times speedup. Unlikely I will further work on that.

yaolu commented 4 years ago

@Alex-Fabbri Any plan to merge it?

Alex-Fabbri commented 4 years ago

Thanks for the contribution @yaolu ! I lost track of this pull request but am merging it now.

Alex-Fabbri / Multi-News

12 times speedup for inference #24