chenyangh / DSLP

Deeply Supervised, Layer-wise Prediction-aware (DSLP) Transformer for Non-autoregressive Neural Machine Translation
MIT License
43 stars 5 forks source link

No glat_sd arch #12

Open bbo0924 opened 2 years ago

bbo0924 commented 2 years ago

Hi Chengyang, thanks for your great code! I'm trying to reproduce the GLAT+DSLP model, I checked your given training scripts, but I found there is no "--arch glat_sd" registered model in the code, is it should be "nat_sd_glat"? BTW, what's the meaning of "ss" and "sd"? Does "sd" mean supervised deeply? how about "ss" Thank for your answer!!

chenyangh commented 2 years ago

Hello @bbo0924 .

Yes, you are right. It should be nat_sd_glat. Sorry for the mistake, I will fix it. Thanks.

chenyangh commented 2 years ago

The meaning of ss and sd was used for development, which I should have changed after writing the paper. So ss means schedule sampling, where I mix the ground truth tokens with predicted tokens. The s is a notation for layer-wise prediction, but I don't really remember why I used s. d means deep supervision.