flairNLP / flair

A very simple framework for state-of-the-art Natural Language Processing (NLP)
https://flairnlp.github.io/flair/
Other
13.84k stars 2.09k forks source link

BPEmbeddings with sequence tagger #1898

Closed UrszulaCzerwinska closed 3 years ago

UrszulaCzerwinska commented 3 years ago

Describe the bug While running sequence tagger with stacked embeddings: bytePairEmbeddings and Flair embeddings, an error occurs:

>>>CUDA_VISIBLE_DEVICES=1 python train_seq_tagger.py

PyTorch version 1.3.1 available.
TensorFlow version 2.0.0 available.
2020-10-07 11:58:39,499 Reading data from data
2020-10-07 11:58:39,499 Train: data/train.txt
2020-10-07 11:58:39,499 Dev: data/dev.txt
2020-10-07 11:58:39,500 Test: data/test.txt
2020-10-07 12:00:20,735 ----------------------------------------------------------------------------------------------------
2020-10-07 12:00:20,736 Model: "SequenceTagger(
  (embeddings): StackedEmbeddings(
    (list_embedding_0): FlairEmbeddings(
      (lm): LanguageModel(
        (drop): Dropout(p=0.5, inplace=False)
        (encoder): Embedding(275, 100)
        (rnn): LSTM(100, 1024)
        (decoder): Linear(in_features=1024, out_features=275, bias=True)
      )
    )
    (list_embedding_1): FlairEmbeddings(
      (lm): LanguageModel(
        (drop): Dropout(p=0.5, inplace=False)
        (encoder): Embedding(275, 100)
        (rnn): LSTM(100, 1024)
        (decoder): Linear(in_features=1024, out_features=275, bias=True)
      )
    )
    (list_embedding_2): BytePairEmbeddings(model=2-bpe-custom-100000-200)
  )
  (word_dropout): WordDropout(p=0.05)
  (locked_dropout): LockedDropout(p=0.5)
  (embedding2nn): Linear(in_features=2448, out_features=2448, bias=True)
  (rnn): LSTM(2448, 256, num_layers=2, batch_first=True, dropout=0.5, bidirectional=True)
  (linear): Linear(in_features=512, out_features=47, bias=True)
  (beta): 1.0
  (weights): None
  (weight_tensor) None
)"
2020-10-07 12:00:20,736 ----------------------------------------------------------------------------------------------------
2020-10-07 12:00:20,736 Corpus: "Corpus: 183145 train + 24944 dev + 21721 test sentences"
2020-10-07 12:00:20,736 ----------------------------------------------------------------------------------------------------
2020-10-07 12:00:20,737 Parameters:
2020-10-07 12:00:20,737  - learning_rate: "0.1"
2020-10-07 12:00:20,737  - mini_batch_size: "32"
2020-10-07 12:00:20,737  - patience: "2"
2020-10-07 12:00:20,737  - anneal_factor: "0.5"
2020-10-07 12:00:20,737  - max_epochs: "100"
2020-10-07 12:00:20,737  - shuffle: "True"
2020-10-07 12:00:20,737  - train_with_dev: "False"
2020-10-07 12:00:20,737  - batch_growth_annealing: "False"
2020-10-07 12:00:20,737 ----------------------------------------------------------------------------------------------------
2020-10-07 12:00:20,737 Model training base path: "outputs/models/bpemb_flair"
2020-10-07 12:00:20,737 ----------------------------------------------------------------------------------------------------
2020-10-07 12:00:20,738 Device: cuda:0
2020-10-07 12:00:20,738 ----------------------------------------------------------------------------------------------------
2020-10-07 12:00:20,738 Embeddings storage mode: gpu
Selected optimization level O1:  Insert automatic casts around Pytorch functions and Tensor methods.

Defaults for this optimization level are:
enabled                : True
opt_level              : O1
cast_model_type        : None
patch_torch_functions  : True
keep_batchnorm_fp32    : None
master_weights         : None
loss_scale             : dynamic
Processing user overrides (additional kwargs that are not None)...
After processing overrides, optimization options are:
enabled                : True
opt_level              : O1
cast_model_type        : None
patch_torch_functions  : True
keep_batchnorm_fp32    : None
master_weights         : None
loss_scale             : dynamic
2020-10-07 12:00:20,744 ----------------------------------------------------------------------------------------------------
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 32768.0
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 16384.0
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 8192.0
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 4096.0
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 2048.0
2020-10-07 12:05:33,604 epoch 1 - iter 572/5724 - loss 3.94088415 - samples/sec: 58.52 - lr: 0.100000
Traceback (most recent call last):
  File "train_seq_tagger.py", line 104, in <module>
    trainer.train(**params["train"])
  File "/home/ccass/anaconda3/lib/python3.7/site-packages/flair/trainers/trainer.py", line 371, in train
    loss = self.model.forward_loss(batch_step)
  File "/home/ccass/anaconda3/lib/python3.7/site-packages/flair/models/sequence_tagger_model.py", line 603, in forward_loss
    features = self.forward(data_points)
  File "/home/ccass/anaconda3/lib/python3.7/site-packages/flair/models/sequence_tagger_model.py", line 608, in forward
    self.embeddings.embed(sentences)
  File "/home/ccass/anaconda3/lib/python3.7/site-packages/flair/embeddings/token.py", line 71, in embed
    embedding.embed(sentences)
  File "/home/ccass/anaconda3/lib/python3.7/site-packages/flair/embeddings/base.py", line 60, in embed
    self._add_embeddings_internal(sentences)
  File "/home/ccass/anaconda3/lib/python3.7/site-packages/flair/embeddings/token.py", line 1580, in _add_embeddings_internal
    (embeddings[0], embeddings[len(embeddings) - 1])
IndexError: index 0 is out of bounds for axis 0 with size 0
ERROR: CUDA_VISIBLE_DEVICES=1 python train_seq_tagger.py, exited with 1

To Reproduce Run the sequence tagger trainer with stacked flair embeddings and custom bpembeddings

Expected behavior Training sequence tagger

Environment (please complete the following information):

Additional context

embed words in sentence

embedding.embed(sentence)


* The code for sequence tagger was previously tested with other embeddings (fasttext) and worked correctly.
* Applying the same code with BytePairEmbeddings(language='fr') does not throws any error.

What can be wrong with the tuned bpemdeddings?
elliotbart commented 3 years ago

Hello Urszula,

I encountered the same issue while using custom BytePairEmbeddings, and found some insights about the issue, see below.

Bug

https://github.com/flairNLP/flair/blob/master/flair/embeddings/token.py

l. 1745 For some tokens, self.embedder.embed(word.lower()) returns an empty list, which next raises the IndexError.

Additional context

The likely reason for that is the normalization rules of the underlying sentencepiece model for subword tokenization:

These two different schemes therefore give different results for some tokens:

# This embbeds a BPEmb model with underlying sentencepiece tokenization with nmt_nfkc normalization
>>> bpe_custom = BytePairEmbeddings(model_file_path='sentencepiece.model', embedding_file_path=embeddings.bin)
>>> bpe_custom.embedder.spm.encode("�", out_type=str)
[]
>>> bpe_custom.embedder.embed("�")
array([], shape=(0, 50), dtype=float32)
>>> bpe_custom.embedder.spm.encode("\n", out_type=str)
[]
>>> bpe_custom.embedder.embed("\n")
array([], shape=(0, 50), dtype=float32)

VS

# This embbeds a BPEmb model with underlying sentencepiece tokenization with nfkc normalization
>>> bpe_fr = BytePairEmbeddings('fr')
>>> bpe_fr.embedder.spm.encode("�", out_type=str)
['▁', '�']
>>> bpe_fr.embedder.embed("�")
array([[ 0.863682,  0.623915, -0.255492,  1.228884, -0.246349, -0.235584,
         0.924933,  1.468551, -1.046001, -0.313229,  0.924974, -0.26374 ,
        -0.215517,  0.310154, -0.281002,  0.127435,  0.297852, -1.035336,
         0.656995,  0.740548,  0.324117,  0.571423, -0.735685,  0.262373,
         0.174549, -0.070397, -0.137978,  0.774121, -0.859513,  0.846455,
        -0.30908 , -0.048569,  0.431066,  0.530602,  0.025365,  0.018068,
        -0.215856,  0.038948, -0.724266,  0.74875 ,  0.269831, -0.273661,
         0.426436,  0.597654,  0.568705, -0.111608, -0.125169,  0.067656,
         0.385495,  0.18757 ],
       [ 0.979594,  0.57784 , -0.222435,  1.486768, -0.380972, -0.35193 ,
         0.901553,  2.116044, -1.18345 , -0.272132,  0.808096, -0.297339,
        -0.288387,  0.523385, -0.516331,  0.409378, -0.363651, -0.650074,
         0.860095,  0.524136,  0.130684,  0.801779, -0.371839,  0.486923,
        -0.213825,  0.155632,  0.054518,  1.182699, -0.681333,  0.921612,
        -0.430549, -0.413449,  0.555705,  0.517503,  0.166901,  0.01226 ,
        -0.426171,  0.016401, -1.095436,  0.761773,  0.123491, -0.225711,
         0.342072,  0.871307,  0.517205, -0.289836, -0.101698, -0.039496,
         0.589295,  0.276277]], dtype=float32)
>>> bpe_fr.embedder.spm.encode("\n", out_type=str)
['▁', '\n']
>>> bpe_fr.embedder.embed("\n")
array([[ 0.863682,  0.623915, -0.255492,  1.228884, -0.246349, -0.235584,
         0.924933,  1.468551, -1.046001, -0.313229,  0.924974, -0.26374 ,
        -0.215517,  0.310154, -0.281002,  0.127435,  0.297852, -1.035336,
         0.656995,  0.740548,  0.324117,  0.571423, -0.735685,  0.262373,
         0.174549, -0.070397, -0.137978,  0.774121, -0.859513,  0.846455,
        -0.30908 , -0.048569,  0.431066,  0.530602,  0.025365,  0.018068,
        -0.215856,  0.038948, -0.724266,  0.74875 ,  0.269831, -0.273661,
         0.426436,  0.597654,  0.568705, -0.111608, -0.125169,  0.067656,
         0.385495,  0.18757 ],
       [ 0.979594,  0.57784 , -0.222435,  1.486768, -0.380972, -0.35193 ,
         0.901553,  2.116044, -1.18345 , -0.272132,  0.808096, -0.297339,
        -0.288387,  0.523385, -0.516331,  0.409378, -0.363651, -0.650074,
         0.860095,  0.524136,  0.130684,  0.801779, -0.371839,  0.486923,
        -0.213825,  0.155632,  0.054518,  1.182699, -0.681333,  0.921612,
        -0.430549, -0.413449,  0.555705,  0.517503,  0.166901,  0.01226 ,
        -0.426171,  0.016401, -1.095436,  0.761773,  0.123491, -0.225711,
         0.342072,  0.871307,  0.517205, -0.289836, -0.101698, -0.039496,
         0.589295,  0.276277]], dtype=float32)

As you can see, bpe_custom.embedder.embed can give an empty embeddings list.

I haven't tested the behavior with other characters and tokens.

Temporary fix:

l. 1738

To set the embeddings to zero for these tokens, you can replace :

if word.strip() == "":

with

if word.strip() == "" or self.embedder.encode(word) == []:

Environment

UrszulaCzerwinska commented 3 years ago

Thank you @elliotbart, I will check it out !

stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.