questions about sentencepiece method learned vocab

dearchill commented 3 years ago

Hi, after i used the sentencepiece method to generate vocabulary, I found in generating vocabulary step, the real volt vocab is not actually used, but the optimal size is used to construct spm model instead. I'm curious about this. I found the volt vocab and spm vocab is quiet different, maybe you did this because spm_encoder needs a spm_model but the VOLT method didn't produce any model? Did this hurt the BLEU?

dearchill commented 3 years ago

Test this, I found the BLEU drops dramatically comparing with the subword-bpe VOLT vocab, is there a better way to use the sentencepiece VOLT vocab? Did the vocabulary restriction works? Looking forward to your reply.

Jingjing-NLP commented 3 years ago

Hi, after i used the sentencepiece method to generate vocabulary, I found in generating vocabulary step, the real volt vocab is not actually used, but the optimal size is used to construct spm model instead. I'm curious about this. I found the volt vocab and spm vocab is quiet different, maybe you did this because spm_encoder needs a spm_model but the VOLT method didn't produce any model? Did this hurt the BLEU?

Hi, we also observe the dramatically performance decay of sentencepiece compared to sub-word nmt vocabularies on machine translation tasks. I guess this problem may be contributed to the usage gap between sentencepiece and sub-word nmt.

All our results are obtained on subword-nmt. Although we observe this gap, we still decide to support volt on sentencepiece for usability. We hope that some tasks where sentencepiece-style vocabularies perform better can enjoy volt too.

We would like to further figure out which factors contribute to the different results between sentencepiece and subword-nmt. If we have some new findings, we will update it on this channel as soon as possible.

Thanks for your question!

Best, Jingjing

dearchill commented 3 years ago

Thanks for your reply and looking forward for your discovery!

Jingjing-NLP / VOLT

questions about sentencepiece method learned vocab #19