flairNLP / flair

A very simple framework for state-of-the-art Natural Language Processing (NLP)
https://flairnlp.github.io/flair/
Other
13.89k stars 2.1k forks source link

Error while using BERT transforemer embeddings #1793

Closed himanshudce closed 3 years ago

himanshudce commented 4 years ago

Describe the bug I trained the RoBERTa language model from scratch(using hugging face), the trained language model is working properly I checked that on masked language task. I used this model for Embedding in flair Model

embedding_types: List[TokenEmbeddings] = [

    TransformerWordEmbeddings(model ='BERT/sumerianBERTo/',  layers = '-1,-2,-3,-4', fine_tune = True, batch_size = 8,use_scalar_mix = False)

    # contextual string embeddings, forward
    #FlairEmbeddings('FLAIR/resources/taggers/language_model/best-lm.pt'),

    # contextual string embeddings, backward
    #FlairEmbeddings('FLAIR/resources/taggers/language_model/best-lm.pt'),
]
embeddings: StackedEmbeddings = StackedEmbeddings(embeddings=embedding_types)

and trained the model, but I am getting the warning -

"Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation."

and than the error -

" Traceback (most recent call last): File "FLAIR/flair_POS_trainer.py", line 64, in trainer.train('FLAIR/resources/taggers/flairposbert', File "/home/himanshu/.local/lib/python3.8/site-packages/flair/trainers/trainer.py", line 349, in train loss = self.model.forward_loss(batch_step) File "/home/himanshu/.local/lib/python3.8/site-packages/flair/models/sequence_tagger_model.py", line 599, in forward_loss features = self.forward(data_points) File "/home/himanshu/.local/lib/python3.8/site-packages/flair/models/sequence_tagger_model.py", line 604, in forward self.embeddings.embed(sentences) File "/home/himanshu/.local/lib/python3.8/site-packages/flair/embeddings/token.py", line 71, in embed embedding.embed(sentences) File "/home/himanshu/.local/lib/python3.8/site-packages/flair/embeddings/base.py", line 61, in embed self._add_embeddings_internal(sentences) File "/home/himanshu/.local/lib/python3.8/site-packages/flair/embeddings/token.py", line 897, in _add_embeddings_internal self._add_embeddings_to_sentences(batch) File "/home/himanshu/.local/lib/python3.8/site-packages/flair/embeddings/token.py", line 1009, in _add_embeddings_to_sentences hidden_states = self.model(input_ids, attention_mask=mask)[-1] File "/home/himanshu/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 550, in call result = self.forward(*input, kwargs) File "/home/himanshu/.local/lib/python3.8/site-packages/transformers/modeling_bert.py", line 752, in forward embedding_output = self.embeddings( File "/home/himanshu/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 550, in call result = self.forward(*input, *kwargs) File "/home/himanshu/.local/lib/python3.8/site-packages/transformers/modeling_roberta.py", line 67, in forward return super().forward( File "/home/himanshu/.local/lib/python3.8/site-packages/transformers/modeling_bert.py", line 179, in forward position_embeddings = self.position_embeddings(position_ids) File "/home/himanshu/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 550, in call result = self.forward(input, kwargs) File "/home/himanshu/.local/lib/python3.8/site-packages/torch/nn/modules/sparse.py", line 112, in forward return F.embedding( File "/home/himanshu/.local/lib/python3.8/site-packages/torch/nn/functional.py", line 1724, in embedding return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse) IndexError: index out of range in self "

To Reproduce

used the below code to train POS model -

from flair.data import Corpus
from flair.datasets import ColumnCorpus
from flair.embeddings import TokenEmbeddings, WordEmbeddings, StackedEmbeddings, FlairEmbeddings, CharacterEmbeddings, TransformerWordEmbeddings
from torch.optim.adam import Adam
from typing import List

# define columns
columns = {0: 'text', 1: 'pos'}

# this is the folder in which train, test and dev files reside
data_folder = 'FLAIR/POS_corpus'

# init a corpus using column format, data folder and the names of the train, dev and test files
corpus: Corpus = ColumnCorpus(data_folder, columns,
                              train_file='train.txt',
                              test_file='test.txt',
                              dev_file='dev.txt')

print(len(corpus.train))                              
print(corpus.train[0].to_tagged_string('pos'))

# tag to predict
tag_type = 'pos'
# make tag dictionary from the corpus
tag_dictionary = corpus.make_tag_dictionary(tag_type=tag_type)

print(tag_dictionary)

# initialize embeddings
embedding_types: List[TokenEmbeddings] = [

    # Word2vec embeddings
    #WordEmbeddings('FLAIR/word2vec50'),
    #CharacterEmbeddings(),

    TransformerWordEmbeddings(model ='BERT/sumerianBERTo/',  layers = '-1,-2,-3,-4', fine_tune = True, batch_size = 8,use_scalar_mix = False)

    # contextual string embeddings, forward
    #FlairEmbeddings('FLAIR/resources/taggers/language_model/best-lm.pt'),

    # contextual string embeddings, backward
    #FlairEmbeddings('FLAIR/resources/taggers/language_model/best-lm.pt'),
]

embeddings: StackedEmbeddings = StackedEmbeddings(embeddings=embedding_types)

# 5. initialize sequence tagger
from flair.models import SequenceTagger

tagger: SequenceTagger = SequenceTagger(hidden_size=256,
                                        embeddings=embeddings,
                                        tag_dictionary=tag_dictionary,
                                        tag_type=tag_type,
                                        use_crf=True)

# 6. initialize trainer
from flair.trainers import ModelTrainer

trainer: ModelTrainer = ModelTrainer(tagger, corpus, optimizer=Adam)

# 7. start training
trainer.train('FLAIR/resources/taggers/flairposbert',
              learning_rate=0.1,
              mini_batch_size=32,
              max_epochs=30)

Expected behavior Trained POS model

Environment (please complete the following information):

alanakbik commented 4 years ago

I think this error is thrown when the tokenizer model does not specify a maximum length. Could you try again with the current master branch version if it works better?

himanshudce commented 4 years ago

Ok I will try, thanks

stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.