the same arch's num parameters in lightseq is different from fairseq

bytedance / lightseq

LightSeq: A High Performance Library for Sequence Processing and Generation

Other

3.21k stars 329 forks source link

the same arch's num parameters in lightseq is different from fairseq #239

Open yzy5630 opened 2 years ago

yzy5630 commented 2 years ago

thanks for nice work! when i use transformer_wmt_en_de_big_t2t in fairseq for translation task the total params in log is 391401472 where for lightseq, ls_transformer_wmt_en_de_big_t2t the params in log is 319721472, dict size is 70000, don't share embedding so, what's the diffrence between the two models, Is the structure of transformer_wmt_en_de_big_t2t and ls_transformer_wmt_en_de_big_t2t really the same?

Taka152 commented 2 years ago

We haven't dived too much into this. When implementing lightseq, we try to use as few parameters and few shared intermediate variables as possible, and a noticeable GPU memory reduction is observed. I think this tool can help us find out the reason, I guess Fairseq's implementation is not optimal on parameter numbers.

hzhwcmhf commented 2 years ago

@yzy5630 To my understanding, Lightseq does not support learnable positional embeddings in the current version. It may cause the differences.