Closed UrszulaCzerwinska closed 3 years ago
Hello Urszula,
I encountered the same issue while using custom BytePairEmbeddings, and found some insights about the issue, see below.
https://github.com/flairNLP/flair/blob/master/flair/embeddings/token.py
l. 1745
For some tokens, self.embedder.embed(word.lower())
returns an empty list, which next raises the IndexError
.
The likely reason for that is the normalization rules of the underlying sentencepiece model for subword tokenization:
nfkc
.nmt_nfkc
. It deletes some "whitespace/invalid" characters while tokenizing.
More details can be found at https://github.com/google/sentencepiece/blob/master/doc/normalization.mdThese two different schemes therefore give different results for some tokens:
# This embbeds a BPEmb model with underlying sentencepiece tokenization with nmt_nfkc normalization
>>> bpe_custom = BytePairEmbeddings(model_file_path='sentencepiece.model', embedding_file_path=embeddings.bin)
>>> bpe_custom.embedder.spm.encode("�", out_type=str)
[]
>>> bpe_custom.embedder.embed("�")
array([], shape=(0, 50), dtype=float32)
>>> bpe_custom.embedder.spm.encode("\n", out_type=str)
[]
>>> bpe_custom.embedder.embed("\n")
array([], shape=(0, 50), dtype=float32)
VS
# This embbeds a BPEmb model with underlying sentencepiece tokenization with nfkc normalization
>>> bpe_fr = BytePairEmbeddings('fr')
>>> bpe_fr.embedder.spm.encode("�", out_type=str)
['▁', '�']
>>> bpe_fr.embedder.embed("�")
array([[ 0.863682, 0.623915, -0.255492, 1.228884, -0.246349, -0.235584,
0.924933, 1.468551, -1.046001, -0.313229, 0.924974, -0.26374 ,
-0.215517, 0.310154, -0.281002, 0.127435, 0.297852, -1.035336,
0.656995, 0.740548, 0.324117, 0.571423, -0.735685, 0.262373,
0.174549, -0.070397, -0.137978, 0.774121, -0.859513, 0.846455,
-0.30908 , -0.048569, 0.431066, 0.530602, 0.025365, 0.018068,
-0.215856, 0.038948, -0.724266, 0.74875 , 0.269831, -0.273661,
0.426436, 0.597654, 0.568705, -0.111608, -0.125169, 0.067656,
0.385495, 0.18757 ],
[ 0.979594, 0.57784 , -0.222435, 1.486768, -0.380972, -0.35193 ,
0.901553, 2.116044, -1.18345 , -0.272132, 0.808096, -0.297339,
-0.288387, 0.523385, -0.516331, 0.409378, -0.363651, -0.650074,
0.860095, 0.524136, 0.130684, 0.801779, -0.371839, 0.486923,
-0.213825, 0.155632, 0.054518, 1.182699, -0.681333, 0.921612,
-0.430549, -0.413449, 0.555705, 0.517503, 0.166901, 0.01226 ,
-0.426171, 0.016401, -1.095436, 0.761773, 0.123491, -0.225711,
0.342072, 0.871307, 0.517205, -0.289836, -0.101698, -0.039496,
0.589295, 0.276277]], dtype=float32)
>>> bpe_fr.embedder.spm.encode("\n", out_type=str)
['▁', '\n']
>>> bpe_fr.embedder.embed("\n")
array([[ 0.863682, 0.623915, -0.255492, 1.228884, -0.246349, -0.235584,
0.924933, 1.468551, -1.046001, -0.313229, 0.924974, -0.26374 ,
-0.215517, 0.310154, -0.281002, 0.127435, 0.297852, -1.035336,
0.656995, 0.740548, 0.324117, 0.571423, -0.735685, 0.262373,
0.174549, -0.070397, -0.137978, 0.774121, -0.859513, 0.846455,
-0.30908 , -0.048569, 0.431066, 0.530602, 0.025365, 0.018068,
-0.215856, 0.038948, -0.724266, 0.74875 , 0.269831, -0.273661,
0.426436, 0.597654, 0.568705, -0.111608, -0.125169, 0.067656,
0.385495, 0.18757 ],
[ 0.979594, 0.57784 , -0.222435, 1.486768, -0.380972, -0.35193 ,
0.901553, 2.116044, -1.18345 , -0.272132, 0.808096, -0.297339,
-0.288387, 0.523385, -0.516331, 0.409378, -0.363651, -0.650074,
0.860095, 0.524136, 0.130684, 0.801779, -0.371839, 0.486923,
-0.213825, 0.155632, 0.054518, 1.182699, -0.681333, 0.921612,
-0.430549, -0.413449, 0.555705, 0.517503, 0.166901, 0.01226 ,
-0.426171, 0.016401, -1.095436, 0.761773, 0.123491, -0.225711,
0.342072, 0.871307, 0.517205, -0.289836, -0.101698, -0.039496,
0.589295, 0.276277]], dtype=float32)
As you can see, bpe_custom.embedder.embed
can give an empty embeddings list.
I haven't tested the behavior with other characters and tokens.
To set the embeddings to zero for these tokens, you can replace :
if word.strip() == "":
with
if word.strip() == "" or self.embedder.encode(word) == []:
Thank you @elliotbart, I will check it out !
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Describe the bug While running sequence tagger with stacked embeddings: bytePairEmbeddings and Flair embeddings, an error occurs:
To Reproduce Run the sequence tagger trainer with stacked flair embeddings and custom bpembeddings
Expected behavior Training sequence tagger
Environment (please complete the following information):
Additional context
embed words in sentence
embedding.embed(sentence)