flairNLP / flair

A very simple framework for state-of-the-art Natural Language Processing (NLP)
https://flairnlp.github.io/flair/
Other
13.94k stars 2.1k forks source link

RuntimeError when training a NER model with flairembedding #376

Closed xuxingya closed 5 years ago

xuxingya commented 5 years ago

Describe the bug I am trying to train a NER model using my own corpus with a pretrained flair embedding. But an runtime error occurred. It looks like there is something wrong with the shape of embedding of tokens. sentence_tensor[s_id][:len(sentence)] = torch.cat([token.get_embedding().unsqueeze(0) for token in sentence], 0) To use BERT embedding, I tokenized my sentence with the BertTokenizer, which produced word and non-alphabetical characters. But the flair embedding was trained based on characters. I wonder if this caused this problem. Here is my code:

char_lm_embeddings = FlairEmbeddings('resources/tagngers/language_model/best-lm.pt',use_cache=True)
columns = {0: 'ner', 1: 'text'}
data_folder = 'flair_data/'
# retrieve corpus using column format, data folder and the names of the train, dev and test files
corpus: TaggedCorpus = NLPTaskDataFetcher.load_column_corpus(data_folder, columns,
                                                              train_file='flair.tsv')
# 3. make the tag dictionary from the corpus
tag_type = 'ner'
tag_dictionary = corpus.make_tag_dictionary(tag_type=tag_type)
print(tag_dictionary.idx2item)
tagger: SequenceTagger = SequenceTagger(hidden_size=128,
                                        embeddings=char_lm_embeddings,
                                        tag_dictionary=tag_dictionary,
                                        tag_type=tag_type,
                                        use_crf=False)
from flair.trainers import ModelTrainer
from flair.training_utils import EvaluationMetric
from flair.optim import AdamW
trainer: ModelTrainer = ModelTrainer(tagger, corpus,
                                     optimizer=AdamW)
# 7. start training
trainer.train('resources/taggers/basic_info_ner',
              learning_rate=0.001,
              mini_batch_size=64,
              max_epochs=3,
              checkpoint=True,
              embeddings_in_memory=False)

image

Environment (please complete the following information):

stefan-it commented 5 years ago

You need to add a backward language model (for FlairEmbeddings) :)

alanakbik commented 5 years ago

Hm strange, the code looks good.

Could you print out some information? I.e. could you do:

print(corpus) 
print(corpus.obtain_statistics())
print(tag_dictionary.idx2item)
print(tagger)

and share each printout?

Also, are you using the current master branch or did you install through pip?

xuxingya commented 5 years ago

Hm strange, the code looks good.

Could you print out some information? I.e. could you do:

print(corpus) 
print(corpus.obtain_statistics())
print(tag_dictionary.idx2item)
print(tagger)

and share each printout?

Also, are you using the current master branch or did you install through pip?

Thank you. I am using the current master branch. Here is result of print out: image Nothing seems wrong with the model itself.

xuxingya commented 5 years ago

You need to add a backward language model (for FlairEmbeddings) :)

Thank you, what do you mean by a backward model? I think the bidirectional LSTM model is backward.

alanakbik commented 5 years ago

Thanks! The printouts look good as well. Maybe something went wrong during the training of the language model.

Could you do this:

model = LanguageModel.load_language_model('resources/tagngers/language_model/best-lm.pt')
text, prob = model.generate_text(temperature=0.4, number_of_characters=300)
print(text)

And paste the text that is generated?

xuxingya commented 5 years ago

Thanks! The printouts look good as well. Maybe something went wrong during the training of the language model.

Could you do this:

model = LanguageModel.load_language_model('resources/tagngers/language_model/best-lm.pt')
text, prob = model.generate_text(temperature=0.4, number_of_characters=300)
print(text)

And paste the text that is generated?

image surprisingly, when I run it today, the problem just disappeared. Anyway it works even I can't understand what has happened. But I soon meet a cuda out of memory problem with batch size 32 on a 12GB memory GPU after numbers of steps. I think it is bit of strange because I can run Pytorch or Tf BERT on the GPU well. It is another problem so I will try to figure it out elsewhere.

alanakbik commented 5 years ago

Good to hear about the previous error but strange that the memory is not enough - 12 GB should be enough.

What is the dictionary size of the LM and how many hidden states does the LM have? Also, what is the longest sentence in your dataset? Out of CUDA memory can happen if your longest sentence is very long.

xuxingya commented 5 years ago

Good to hear about the previous error but strange that the memory is not enough - 12 GB should be enough.

What is the dictionary size of the LM and how many hidden states does the LM have? Also, what is the longest sentence in your dataset? Out of CUDA memory can happen if your longest sentence is very long.

Yes, some of the sentences can be very long. So I need to truncate the long sentence. I see somebody else also issued this out of memory problem, I think the reason may be the same.

alanakbik commented 5 years ago

Ah ok. I think we need to update the implementation of FlairEmbeddings to better deal with long sequences. I'll open a ticket for this!