Open bbo0924 opened 2 years ago
Hello @bbo0924 .
Yes, you are right. It should be nat_sd_glat
. Sorry for the mistake, I will fix it. Thanks.
The meaning of ss
and sd
was used for development, which I should have changed after writing the paper.
So ss
means schedule sampling, where I mix the ground truth tokens with predicted tokens. The s
is a notation for layer-wise prediction, but I don't really remember why I used s
. d
means deep supervision.
Hi Chengyang, thanks for your great code! I'm trying to reproduce the GLAT+DSLP model, I checked your given training scripts, but I found there is no "--arch glat_sd" registered model in the code, is it should be "nat_sd_glat"? BTW, what's the meaning of "ss" and "sd"? Does "sd" mean supervised deeply? how about "ss" Thank for your answer!!