bheinzerling / bpemb

Pre-trained subword embeddings in 275 languages, based on Byte-Pair Encoding (BPE)
https://nlp.h-its.org/bpemb
MIT License
1.18k stars 101 forks source link

UNK words in the prediction output #45

Closed aimanmutasem closed 4 years ago

aimanmutasem commented 4 years ago

Dear @all

I have used bpemb encoding to prevent words, but still, there are some words in the results.

`import bpemb from bpemb import BPEmb bpemb_en = BPEmb(lang="en", vs=50000)

SRC = Field(tokenize = bpemb_en.encode, init_token = '', eos_token = '', lower = True, batch_first = True, fix_length = 100)

TRG = Field(tokenize = bpemb_en.encode, init_token = '', eos_token = '', lower = True, batch_first = True, fix_length = 100)`

trg = ['▁apart', '▁from', '▁that', '▁there', '▁is', '▁no', '▁recommendation', '▁as', '▁to', '▁what', '▁to', '▁wear', '▁.']

predicted trg = ['▁apart', '▁from', '▁that', '▁there', '▁is', '▁no', '<unk>', '▁as', '▁to', '▁wear', '▁to', '▁wear', '▁.', '<eos>']

Do I apply bpemb encode correctly to prevent UNK words?

bheinzerling commented 4 years ago

It's difficult to tell what's going on here. I'm not sure what Field is and what the difference between trg and predicted trg is. Can you provide a minimal working example showing your input, how exactly you call bpemb_en.encode and what output you get?

aimanmutasem commented 4 years ago

Dear @bheinzerling Thank you for your support.

I have to call vectors of pre-training model like :

SRC.build_vocab(train_data, Vectors('wiki.en.vec', url = url) , unk_init = torch.Tensor.normal_, min_freq = 2)

Do you know how I can call pre-training model using bpemb encoding?