DeNederlandscheBank / nqm

A Transformer-based Machine for answering questions on insurance companies
MIT License
0 stars 0 forks source link

Subwords/BPE models have unk in the inputs #49

Closed jm-glowienke closed 3 years ago

jm-glowienke commented 3 years ago

The BPE splitted data contains unk for words ("operating"), but also subwords. Moreover, dicts for 4614, 30387 in model_input are not the correct ones. Worked better for known and unknown words

TODO:

jm-glowienke commented 3 years ago
is likely due to training words with larger iwslt dictionary which leads to encodings, not known at test time actual dict is made fitting to train set
jm-glowienke commented 3 years ago

fixed by removing iwslt dictionary, which is useless as the embeddings only for words in the train_val set will be known