Open dearchill opened 3 years ago
Test this, I found the BLEU drops dramatically comparing with the subword-bpe VOLT vocab, is there a better way to use the sentencepiece VOLT vocab? Did the vocabulary restriction works? Looking forward to your reply.
Hi, after i used the sentencepiece method to generate vocabulary, I found in generating vocabulary step, the real volt vocab is not actually used, but the optimal size is used to construct spm model instead. I'm curious about this. I found the volt vocab and spm vocab is quiet different, maybe you did this because spm_encoder needs a spm_model but the VOLT method didn't produce any model? Did this hurt the BLEU?
Hi, we also observe the dramatically performance decay of sentencepiece compared to sub-word nmt vocabularies on machine translation tasks. I guess this problem may be contributed to the usage gap between sentencepiece and sub-word nmt.
All our results are obtained on subword-nmt. Although we observe this gap, we still decide to support volt on sentencepiece for usability. We hope that some tasks where sentencepiece-style vocabularies perform better can enjoy volt too.
We would like to further figure out which factors contribute to the different results between sentencepiece and subword-nmt. If we have some new findings, we will update it on this channel as soon as possible.
Thanks for your question!
Best, Jingjing
Thanks for your reply and looking forward for your discovery!
Hi, after i used the sentencepiece method to generate vocabulary, I found in generating vocabulary step, the real volt vocab is not actually used, but the optimal size is used to construct spm model instead. I'm curious about this. I found the volt vocab and spm vocab is quiet different, maybe you did this because spm_encoder needs a spm_model but the VOLT method didn't produce any model? Did this hurt the BLEU?