Describe the bug While running sequence tagger with stacked embeddings: bytePairEmbeddings and Flair embeddings, an error occurs:

>>>CUDA_VISIBLE_DEVICES=1 python train_seq_tagger.py

PyTorch version 1.3.1 available.
TensorFlow version 2.0.0 available.
2020-10-07 11:58:39,499 Reading data from data
2020-10-07 11:58:39,499 Train: data/train.txt
2020-10-07 11:58:39,499 Dev: data/dev.txt
2020-10-07 11:58:39,500 Test: data/test.txt
2020-10-07 12:00:20,735 ----------------------------------------------------------------------------------------------------
2020-10-07 12:00:20,736 Model: "SequenceTagger(
  (embeddings): StackedEmbeddings(
    (list_embedding_0): FlairEmbeddings(
      (lm): LanguageModel(
        (drop): Dropout(p=0.5, inplace=False)
        (encoder): Embedding(275, 100)
        (rnn): LSTM(100, 1024)
        (decoder): Linear(in_features=1024, out_features=275, bias=True)
      )
    )
    (list_embedding_1): FlairEmbeddings(
      (lm): LanguageModel(
        (drop): Dropout(p=0.5, inplace=False)
        (encoder): Embedding(275, 100)
        (rnn): LSTM(100, 1024)
        (decoder): Linear(in_features=1024, out_features=275, bias=True)
      )
    )
    (list_embedding_2): BytePairEmbeddings(model=2-bpe-custom-100000-200)
  )
  (word_dropout): WordDropout(p=0.05)
  (locked_dropout): LockedDropout(p=0.5)
  (embedding2nn): Linear(in_features=2448, out_features=2448, bias=True)
  (rnn): LSTM(2448, 256, num_layers=2, batch_first=True, dropout=0.5, bidirectional=True)
  (linear): Linear(in_features=512, out_features=47, bias=True)
  (beta): 1.0
  (weights): None
  (weight_tensor) None
)"
2020-10-07 12:00:20,736 ----------------------------------------------------------------------------------------------------
2020-10-07 12:00:20,736 Corpus: "Corpus: 183145 train + 24944 dev + 21721 test sentences"
2020-10-07 12:00:20,736 ----------------------------------------------------------------------------------------------------
2020-10-07 12:00:20,737 Parameters:
2020-10-07 12:00:20,737  - learning_rate: "0.1"
2020-10-07 12:00:20,737  - mini_batch_size: "32"
2020-10-07 12:00:20,737  - patience: "2"
2020-10-07 12:00:20,737  - anneal_factor: "0.5"
2020-10-07 12:00:20,737  - max_epochs: "100"
2020-10-07 12:00:20,737  - shuffle: "True"
2020-10-07 12:00:20,737  - train_with_dev: "False"
2020-10-07 12:00:20,737  - batch_growth_annealing: "False"
2020-10-07 12:00:20,737 ----------------------------------------------------------------------------------------------------
2020-10-07 12:00:20,737 Model training base path: "outputs/models/bpemb_flair"
2020-10-07 12:00:20,737 ----------------------------------------------------------------------------------------------------
2020-10-07 12:00:20,738 Device: cuda:0
2020-10-07 12:00:20,738 ----------------------------------------------------------------------------------------------------
2020-10-07 12:00:20,738 Embeddings storage mode: gpu
Selected optimization level O1:  Insert automatic casts around Pytorch functions and Tensor methods.

Defaults for this optimization level are:
enabled                : True
opt_level              : O1
cast_model_type        : None
patch_torch_functions  : True
keep_batchnorm_fp32    : None
master_weights         : None
loss_scale             : dynamic
Processing user overrides (additional kwargs that are not None)...
After processing overrides, optimization options are:
enabled                : True
opt_level              : O1
cast_model_type        : None
patch_torch_functions  : True
keep_batchnorm_fp32    : None
master_weights         : None
loss_scale             : dynamic
2020-10-07 12:00:20,744 ----------------------------------------------------------------------------------------------------
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 32768.0
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 16384.0
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 8192.0
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 4096.0
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 2048.0
2020-10-07 12:05:33,604 epoch 1 - iter 572/5724 - loss 3.94088415 - samples/sec: 58.52 - lr: 0.100000
Traceback (most recent call last):
  File "train_seq_tagger.py", line 104, in <module>
    trainer.train(**params["train"])
  File "/home/ccass/anaconda3/lib/python3.7/site-packages/flair/trainers/trainer.py", line 371, in train
    loss = self.model.forward_loss(batch_step)
  File "/home/ccass/anaconda3/lib/python3.7/site-packages/flair/models/sequence_tagger_model.py", line 603, in forward_loss
    features = self.forward(data_points)
  File "/home/ccass/anaconda3/lib/python3.7/site-packages/flair/models/sequence_tagger_model.py", line 608, in forward
    self.embeddings.embed(sentences)
  File "/home/ccass/anaconda3/lib/python3.7/site-packages/flair/embeddings/token.py", line 71, in embed
    embedding.embed(sentences)
  File "/home/ccass/anaconda3/lib/python3.7/site-packages/flair/embeddings/base.py", line 60, in embed
    self._add_embeddings_internal(sentences)
  File "/home/ccass/anaconda3/lib/python3.7/site-packages/flair/embeddings/token.py", line 1580, in _add_embeddings_internal
    (embeddings[0], embeddings[len(embeddings) - 1])
IndexError: index 0 is out of bounds for axis 0 with size 0
ERROR: CUDA_VISIBLE_DEVICES=1 python train_seq_tagger.py, exited with 1

To Reproduce Run the sequence tagger trainer with stacked flair embeddings and custom bpembeddings

Expected behavior Training sequence tagger

Environment (please complete the following information):

Flair version: @ git+https://github.com/flairNLP/flair.git@d75b82bc6a33f4655c46f0e19d09ee5f2c24c93d
Platform: Linux-5.3.0-28-generic-x86_64-with-debian-buster-sid
Python version: 3.7.3
PyTorch version (GPU?): 1.3.1 (True)
Tensorflow version (GPU?): 2.0.0 (False)
Using GPU in script?: True
Using distributed or parallel set-up in script?: No

Additional context

Used vectors work correctly with


from flair.embeddings import BytePairEmbeddings
embedding = BytePairEmbeddings(model_file_path='./m.model', embedding_file_path='bpemb-vocab100k-dim200.w2v')
from flair.data import Sentence
# create a sentence
sentence = Sentence('M. Tutu a été présent le 31 octobre 2020 à la séance de yoga.')

embed words in sentence

embedding.embed(sentence)


* The code for sequence tagger was previously tested with other embeddings (fasttext) and worked correctly.
* Applying the same code with BytePairEmbeddings(language='fr') does not throws any error.

What can be wrong with the tuned bpemdeddings?

Hello Urszula,

I encountered the same issue while using custom BytePairEmbeddings, and found some insights about the issue, see below.

Bug

https://github.com/flairNLP/flair/blob/master/flair/embeddings/token.py

l. 1745 For some tokens, self.embedder.embed(word.lower()) returns an empty list, which next raises the IndexError.

Additional context

The likely reason for that is the normalization rules of the underlying sentencepiece model for subword tokenization:

In previous versions of sentencepiece (the ones used for pre-defined BPEmb languages offered by Flair), the normalization scheme was nfkc.
In the current version of sentencepiece (likely the one you use to train your custom BPEmb), the normalization scheme is nmt_nfkc. It deletes some "whitespace/invalid" characters while tokenizing. More details can be found at https://github.com/google/sentencepiece/blob/master/doc/normalization.md

These two different schemes therefore give different results for some tokens:

# This embbeds a BPEmb model with underlying sentencepiece tokenization with nmt_nfkc normalization
>>> bpe_custom = BytePairEmbeddings(model_file_path='sentencepiece.model', embedding_file_path=embeddings.bin)
>>> bpe_custom.embedder.spm.encode("�", out_type=str)
[]
>>> bpe_custom.embedder.embed("�")
array([], shape=(0, 50), dtype=float32)
>>> bpe_custom.embedder.spm.encode("\n", out_type=str)
[]
>>> bpe_custom.embedder.embed("\n")
array([], shape=(0, 50), dtype=float32)

# This embbeds a BPEmb model with underlying sentencepiece tokenization with nfkc normalization
>>> bpe_fr = BytePairEmbeddings('fr')
>>> bpe_fr.embedder.spm.encode("�", out_type=str)
['▁', '�']
>>> bpe_fr.embedder.embed("�")
array([[ 0.863682,  0.623915, -0.255492,  1.228884, -0.246349, -0.235584,
         0.924933,  1.468551, -1.046001, -0.313229,  0.924974, -0.26374 ,
        -0.215517,  0.310154, -0.281002,  0.127435,  0.297852, -1.035336,
         0.656995,  0.740548,  0.324117,  0.571423, -0.735685,  0.262373,
         0.174549, -0.070397, -0.137978,  0.774121, -0.859513,  0.846455,
        -0.30908 , -0.048569,  0.431066,  0.530602,  0.025365,  0.018068,
        -0.215856,  0.038948, -0.724266,  0.74875 ,  0.269831, -0.273661,
         0.426436,  0.597654,  0.568705, -0.111608, -0.125169,  0.067656,
         0.385495,  0.18757 ],
       [ 0.979594,  0.57784 , -0.222435,  1.486768, -0.380972, -0.35193 ,
         0.901553,  2.116044, -1.18345 , -0.272132,  0.808096, -0.297339,
        -0.288387,  0.523385, -0.516331,  0.409378, -0.363651, -0.650074,
         0.860095,  0.524136,  0.130684,  0.801779, -0.371839,  0.486923,
        -0.213825,  0.155632,  0.054518,  1.182699, -0.681333,  0.921612,
        -0.430549, -0.413449,  0.555705,  0.517503,  0.166901,  0.01226 ,
        -0.426171,  0.016401, -1.095436,  0.761773,  0.123491, -0.225711,
         0.342072,  0.871307,  0.517205, -0.289836, -0.101698, -0.039496,
         0.589295,  0.276277]], dtype=float32)
>>> bpe_fr.embedder.spm.encode("\n", out_type=str)
['▁', '\n']
>>> bpe_fr.embedder.embed("\n")
array([[ 0.863682,  0.623915, -0.255492,  1.228884, -0.246349, -0.235584,
         0.924933,  1.468551, -1.046001, -0.313229,  0.924974, -0.26374 ,
        -0.215517,  0.310154, -0.281002,  0.127435,  0.297852, -1.035336,
         0.656995,  0.740548,  0.324117,  0.571423, -0.735685,  0.262373,
         0.174549, -0.070397, -0.137978,  0.774121, -0.859513,  0.846455,
        -0.30908 , -0.048569,  0.431066,  0.530602,  0.025365,  0.018068,
        -0.215856,  0.038948, -0.724266,  0.74875 ,  0.269831, -0.273661,
         0.426436,  0.597654,  0.568705, -0.111608, -0.125169,  0.067656,
         0.385495,  0.18757 ],
       [ 0.979594,  0.57784 , -0.222435,  1.486768, -0.380972, -0.35193 ,
         0.901553,  2.116044, -1.18345 , -0.272132,  0.808096, -0.297339,
        -0.288387,  0.523385, -0.516331,  0.409378, -0.363651, -0.650074,
         0.860095,  0.524136,  0.130684,  0.801779, -0.371839,  0.486923,
        -0.213825,  0.155632,  0.054518,  1.182699, -0.681333,  0.921612,
        -0.430549, -0.413449,  0.555705,  0.517503,  0.166901,  0.01226 ,
        -0.426171,  0.016401, -1.095436,  0.761773,  0.123491, -0.225711,
         0.342072,  0.871307,  0.517205, -0.289836, -0.101698, -0.039496,
         0.589295,  0.276277]], dtype=float32)

As you can see, bpe_custom.embedder.embed can give an empty embeddings list.

I haven't tested the behavior with other characters and tokens.

Temporary fix:

l. 1738

To set the embeddings to zero for these tokens, you can replace :

if word.strip() == "":

with

if word.strip() == "" or self.embedder.encode(word) == []:

Environment

Flair version: 0.7
Python version: 3.7.9
PyTorch version: 1.7.0 + Cuda 9.2
Using GPU in script?: True
Using distributed or parallel set-up in script?: No

flairNLP / flair

BPEmbeddings with sequence tagger #1898

embed words in sentence

Bug

Additional context

Temporary fix:

Environment