flairNLP / flair

A very simple framework for state-of-the-art Natural Language Processing (NLP)
https://flairnlp.github.io/flair/
Other
13.9k stars 2.1k forks source link

[Question]: Using TransformerWordEmbeddings with SequenceTagger #3161

Closed larsbun closed 1 year ago

larsbun commented 1 year ago

Question

With a setup such as this:

embeddings = TransformerWordEmbeddings(
    model='distilbert-base-uncased',
    layers='-1',
    subtoken_pooling='first',
    fine_tune=True,
    use_context=False)

and a tagger like this:

tagger = SequenceTagger(hidden_size=256,
                        embeddings=embeddings,
                        rnn_type = "GRU",
                        rnn_layers = 10,
                        tag_dictionary=label_dict,
                        tag_type=label_type,
                        use_crf=False)

trained like this:

trainer.train(experiment_root,
              main_evaluation_metric = ("micro avg", "f1-score"),
              learning_rate=0.1,
              mini_batch_size=32,
              max_epochs=150,
              #num_workers=16,
              use_tensorboard=True,
              #train_with_dev = True,
              tensorboard_log_dir = experiment_root + '/log')

My script fails like this:

2023-03-23 19:32:24,779 Device: cuda:0
2023-03-23 19:32:24,779 ----------------------------------------------------------------------------------------------------
2023-03-23 19:32:24,779 Embeddings storage mode: cpu
2023-03-23 19:32:24,779 ----------------------------------------------------------------------------------------------------
2023-03-23 19:32:31,152 epoch 1 - iter 21/211 - loss 0.60071482 - time (sec): 6.37 - samples/sec: 1858.43 - lr: 0.100000
2023-03-23 19:32:36,542 epoch 1 - iter 42/211 - loss 0.56385709 - time (sec): 11.76 - samples/sec: 1991.51 - lr: 0.100000
2023-03-23 19:32:42,305 epoch 1 - iter 63/211 - loss 0.54369444 - time (sec): 17.53 - samples/sec: 1991.74 - lr: 0.100000
2023-03-23 19:32:47,552 epoch 1 - iter 84/211 - loss 0.53814725 - time (sec): 22.77 - samples/sec: 2014.78 - lr: 0.100000
2023-03-23 19:32:52,718 epoch 1 - iter 105/211 - loss 0.53346440 - time (sec): 27.94 - samples/sec: 2042.37 - lr: 0.100000
2023-03-23 19:32:58,202 epoch 1 - iter 126/211 - loss 0.52911241 - time (sec): 33.42 - samples/sec: 2059.77 - lr: 0.100000
2023-03-23 19:33:03,823 epoch 1 - iter 147/211 - loss 0.52809545 - time (sec): 39.04 - samples/sec: 2054.40 - lr: 0.100000
Traceback (most recent call last):
  File "/cluster/work/users/larsbun/FlairMalvik/seqseq-multiclass-saga-roberta.py", line 109, in <module>
    trainer.train(experiment_root,
  File "/cluster/home/larsbun/envs/flairP3.10/lib/python3.10/site-packages/flair/trainers/trainer.py", line 541, in train
    loss, datapoint_count = self.model.forward_loss(batch_step)
  File "/cluster/home/larsbun/envs/flairP3.10/lib/python3.10/site-packages/flair/models/sequence_tagger_model.py", line 274, in forward_loss
    sentence_tensor, lengths = self._prepare_tensors(sentences)
  File "/cluster/home/larsbun/envs/flairP3.10/lib/python3.10/site-packages/flair/models/sequence_tagger_model.py", line 287, in _prepare_tensors
    self.embeddings.embed(sentences)
  File "/cluster/home/larsbun/envs/flairP3.10/lib/python3.10/site-packages/flair/embeddings/base.py", line 49, in embed
    self._add_embeddings_internal(data_points)
  File "/cluster/home/larsbun/envs/flairP3.10/lib/python3.10/site-packages/flair/embeddings/transformer.py", line 656, in _add_embeddings_internal
    embeddings = self._forward_tensors(tensors)
  File "/cluster/home/larsbun/envs/flairP3.10/lib/python3.10/site-packages/flair/embeddings/transformer.py", line 1313, in _forward_tensors
    return self.forward(**tensors)
  File "/cluster/home/larsbun/envs/flairP3.10/lib/python3.10/site-packages/flair/embeddings/transformer.py", line 1275, in forward
    all_token_embeddings = fill_masked_elements(
  File "/cluster/home/larsbun/envs/flairP3.10/lib/python3.10/site-packages/torch/jit/_trace.py", line 1136, in wrapper
    return fn(*args, **kwargs)
  File "/cluster/home/larsbun/envs/flairP3.10/lib/python3.10/site-packages/flair/embeddings/transformer.py", line 117, in fill_masked_elements
    all_token_embeddings[i, : lengths[i], :] = insert_missing_embeddings(  # type: ignore
RuntimeError: The expanded size of the tensor (1) must match the existing size (0) at non-singleton dimension 0.  Target sizes: [1, 768].  Tensor sizes: [0, 768]

i.e., after the first epoch has completed. I see that there is a mismatch between the presentation of and the expectation for the embeddings, but it is not clear to me how to fix it. Is there something I misunderstand conceptually with the embeddings? If I use flairembeddings instead, the same setup works without fault.

helpmefindaname commented 1 year ago

Hi @larsbun I suppose there is a specific sentence in your dataset that leads to the problem. You can use the following code to find all such sentences:

corpus = ....
embeddings = ...
invalid_sentences = []
for sentence in corpus.get_all_sentences():
    try:
        embeddings.embed(sentence)
    except:
        invalid_sentences.append(sentence)
print("There are", len(invalid_sentences), "invalid sentences")
print(invalid_sentences[0])

Can you please run this code twice to check if it is consistent and if it is, please share an example that fails?

larsbun commented 1 year ago

Hi,

thanks for your pointer. Indeed, the data was faulty, with higher-order utf8 characters causing it to stop:

''' There are 1 invalid sentences Sentence[1]: "" → [""/c] '''

This didn't occur to me as a the reason, as the FlairEmbeddings tackled the data without problem. Anyway, now it's out there and searchable if others experience same.

stale[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.