Open yzy5630 opened 2 years ago
We haven't dived too much into this. When implementing lightseq, we try to use as few parameters and few shared intermediate variables as possible, and a noticeable GPU memory reduction is observed. I think this tool can help us find out the reason, I guess Fairseq's implementation is not optimal on parameter numbers.
@yzy5630 To my understanding, Lightseq does not support learnable positional embeddings in the current version. It may cause the differences.
thanks for nice work! when i use transformer_wmt_en_de_big_t2t in fairseq for translation task the total params in log is 391401472 where for lightseq, ls_transformer_wmt_en_de_big_t2t the params in log is 319721472, dict size is 70000, don't share embedding so, what's the diffrence between the two models, Is the structure of transformer_wmt_en_de_big_t2t and ls_transformer_wmt_en_de_big_t2t really the same?