flairNLP / flair

A very simple framework for state-of-the-art Natural Language Processing (NLP)
https://flairnlp.github.io/flair/
Other
13.78k stars 2.08k forks source link

How and where to add add_unk = True #2777

Closed fmafelipe closed 1 year ago

fmafelipe commented 2 years ago

Hello, I am training a ner model, apparently I have a wrong tag in my training corpus because during the training process I get this error. errorner

I already checked the corpus several times and I can't find the error, so I want to do what the problem tells me to add add_unk = True, but I don't know where to add it, in what part of the code. The code I am using is the following

from flair.data import Corpus
from flair.datasets import ColumnCorpus
from flair.embeddings import TransformerWordEmbeddings
from flair.models import SequenceTagger
from flair.trainers import ModelTrainer
with tf.device('/device:GPU:0'):

  # 1. get the corpus
  columns = {0:'text',1:'ner'}
  data_folder = '/content/drive/MyDrive/corpus de prueba/entrenamiento1'

  corpus: Corpus = ColumnCorpus(data_folder, columns,
                              train_file='train.txt',
                              test_file='test.txt',
                              dev_file='dev.txt')

  print(len(corpus.train))

  # 2. what label do we want to predict?
  label_type = 'ner'

  # 3. make the label dictionary from the corpus
  label_dict = corpus.make_label_dictionary(label_type=label_type)
  print("el diccionario de mi corpus contiene las etiquetas: ",label_dict)

  # 4. initialize fine-tuneable transformer embeddings WITH document context
  embeddings = TransformerWordEmbeddings(model='bert-base-multilingual-cased',
                                       layers="-1",
                                       subtoken_pooling="first",
                                       fine_tune=True,
                                       use_context=True,
                                       )

  # 5. initialize bare-bones sequence tagger (no CRF, no RNN, no reprojection)
  tagger = SequenceTagger(hidden_size=256,
                        embeddings=embeddings,
                        tag_dictionary=label_dict,
                        tag_type='ner',
                        use_crf=False,
                        use_rnn=False,
                        reproject_embeddings=False,
                        )

  # 6. initialize trainer
  trainer = ModelTrainer(tagger, corpus)

  # 7. run fine-tuning
  trainer.fine_tune('resources/taggers/pruebatest',
                  learning_rate=5.0e-3,
                  mini_batch_size=2,
                  max_epochs=5,
                  #mini_batch_chunk_size=1,  # remove this parameter to speed up computation if you have a big GPU
                  )

Thanks

alanakbik commented 2 years ago

@fmafelipe what Flair version are you on?

fmafelipe commented 2 years ago

@alanakbik I am using Colab, it shows that the version "flair-0.11.2" is installed, although to use TransformerWordEmbeddings it had an error and I had to specify the installation of transformers==4.18.0

alanakbik commented 2 years ago

We are on Flair 0.11.3 now, perhaps you could test with this version? If the error persists, could you supply a minimal working example with data to reproduce this error?

stale[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

AlphaBit95 commented 1 year ago

@alanakbik

I’m still facing this issue on the version 0.11.3.

Logs: 2023-01-28 21:58:59,809 The string 'B-' is not in dictionary! Dictionary contains only: ['O', 'S-INNOVATION', 'B-INNOVATION', 'E-INNOVATION', 'I-INNOVATION', 'S-UTILIZATION', 'B-UTILIZATION', 'E-UTILIZATION', 'I-UTILIZATION', 'S-MATERIAL', 'B-MATERIAL', 'E-MATERIAL', 'I-MATERIAL', '', ''] 2023-01-28 21:58:59,810 You can create a Dictionary that handles unknown items with an -key by setting add_unk = True in the construction. 0%|

I also tried adding the argument explicitly as follows:

tag_dictionary = corpus.make_label_dictionary("ner", add_unk=True)

Screenshot 2023-01-28 at 23 13 36

But it does not seem to be recognised as the get the following error: tag_dictionary = corpus.make_label_dictionary("ner", add_unk=True) TypeError: make_label_dictionary() got an unexpected keyword argument 'add_unk'