Subwords/BPE models have unk in the inputs

DeNederlandscheBank / nqm

A Transformer-based Machine for answering questions on insurance companies

MIT License

0 stars 0 forks source link

Subwords/BPE models have unk in the inputs #49

Closed jm-glowienke closed 3 years ago

jm-glowienke commented 3 years ago

The BPE splitted data contains unk for words ("operating"), but also subwords. Moreover, dicts for 4614, 30387 in model_input are not the correct ones. Worked better for known and unknown words

TODO:

[x] Check the dict copying and generation process
[x] Fix that subwords does not result in unk
[x] Maybe move splitting to extra process instead done by fairseq (apply_bpe.py with vocabulary option)

jm-glowienke commented 3 years ago

is likely due to training words with larger iwslt dictionary which leads to encodings, not known at test time actual dict is made fitting to train set

jm-glowienke commented 3 years ago

fixed by removing iwslt dictionary, which is useless as the embeddings only for words in the train_val set will be known