Generating vmap for en->many model

santha96 commented 2 years ago

Hi, vmap is useful to reduce inference time significantly. Able to generate vmap for many to one model and its works fine. How does vmap work for one to many models?

guillaumekln commented 2 years ago

Hi,

It will work similarly, but the list of candidates for a given source sentence will include tokens/words from multiple languages.

santha96 commented 2 years ago

Hi, Generated vmap using the below command. python build-vmap.py -pt phrase-table -ms 3 -mf 2 -km 20 -tv target_vocabulary -zg zg_list > vmap

Enabling vmap in one to many directions in ctranslate2 leads to a bleu score drop of 2-3 points per language. Also when I looked inside generated vmap, the source tokens followed by the supervision language tag capture more meaning in the corresponding language due to the presence of tags but other source tokens which is far away from the language tag either capture meaning from a few languages or it seems to be insufficient coverage due to many languages. will increasing keep meaning(-km) parameter help or is there any better way to do it?.can you pls suggest it?

guillaumekln commented 2 years ago

Indeed the current approach may not work well for one to many data. I can't think of a parameter that can fully resolve your issue. It looks like a solution would be to have one vmap per target language? The inference code could then select the appropriate vmap based on the language token.

santha96 commented 2 years ago

Thanks, @guillaumekln .do we have such support in ctranslate2?

guillaumekln commented 2 years ago

No, this logic is not implemented. It is only an idea.

OpenNMT / papers

Generating vmap for en->many model #5