NVIDIA / OpenSeq2Seq

Toolkit for efficient experimentation with Speech Recognition, Text2Speech and NLP
https://nvidia.github.io/OpenSeq2Seq
Apache License 2.0
1.54k stars 369 forks source link

18.11 dev transformer encoder padding #258

Closed vsuthichai closed 5 years ago

vsuthichai commented 5 years ago

Allow configuration to disable the removal and re-adding of padding within the Transformer encoder's feed forward network provides a decent decrease in time per step under this configuration:

1 single Volta v100 16gb gpu TF 1.10 mixed precision training TransformerDataLayer (have not done thorough testing with ParallelTextDataLayer using this new configuration) 3584 token batch size pad to eight enabled

The time per step has dropped from an average of 253ms to 218ms. Tensor2tensor also allows this option I believe. Fairseq does not remove and re-add padding at all, there is no option. I speculate that removing and re-adding padding back is a TPU optimization where batch sizes can be of 56k tokens. In these cases, there could really be a lot of padding tokens. Two dense layers are sandwiched in between the removal and re-adding of tokens, so there would be some efficiency gain on a TPU. However, on a Volta v100 16gb and using TransformerDataLayer with almost 90% tokens being non-padded, I see no gains, and in fact the avg time per step is even slower. This optimization should help in distributed multi-node horovod settings as well.

The code below ends up getting not executed with remove_padding = False. This avoids building out the part of the graph that does removal and re-adding of padding.

https://github.com/NVIDIA/OpenSeq2Seq/blob/792b8020f51bd206cae14db3d45ce0fbff81f7c9/open_seq2seq/parts/transformer/ffn_layer.py#L44-L57

https://github.com/NVIDIA/OpenSeq2Seq/blob/792b8020f51bd206cae14db3d45ce0fbff81f7c9/open_seq2seq/parts/transformer/ffn_layer.py#L64-L72

I have left the "remove_padding" option to "True" by default, since the default data layer is ParallelTextDataLayer and it removes padding by default. I'm wondering though, why the TransformerDataLayer is not used? I've found it to be better for throughput (tokens / second).