flairNLP / flair

A very simple framework for state-of-the-art Natural Language Processing (NLP)
https://flairnlp.github.io/flair/
Other
13.7k stars 2.08k forks source link

[Question]: Extending NER tags of Hunflair #3447

Open skywalker2202 opened 2 months ago

skywalker2202 commented 2 months ago

Question

I wanted to fine-tune the Hunflair-gene model and extend the tags in the original model. The Hunflair gene contains the following items - ['', 'O', 'S-Gene', 'B-Gene', 'I-Gene', 'E-Gene', '', ''].

However, when I do "previous_tag_dictionary.span_labels()" gives "AttributeError: 'Dictionary' object has no attribute 'span_labels'"

previous_tagger = SequenceTagger.load("hunflair-gene") previous_tag_dictionary = previous_tagger.label_dictionary previous_tag_dictionary.get_items()

outputs ['<unk>', 'O', 'S-Gene', 'B-Gene', 'I-Gene', 'E-Gene', '<START>', '<STOP>'].

I have my annotated corpus which contains 2 tags - LIG and REC. I have converted them to a column-corpus and created a new tag dictionary from it. columns = {0: 'text', 1: 'ner'} corpus = ColumnCorpus(config["data_folder"], columns, train_file='train.txt', dev_file='val.txt', test_file="test.txt") new_tag_dictionary = corpus.make_label_dictionary(label_type='ner', add_unk=False) new_tag_dictionary.get_items()

Which outputs

`2024-04-26 16:16:18,169 Dictionary created for label 'ner' with 2 values: LIG (seen 719 times), REC (seen 296 times)

['LIG', 'REC'] ` I want to finetune the hunflair-gene on the new dataset. As per my understanding, I need to create a new tag dictionary. When I try the following

for old_tag in previous_tag_dictionary.get_items(): new_tag_dictionary.add_item(str(old_tag))

print(f"Updated tag dictionary : {new_tag_dictionary}") it outputsUpdated tag dictionary : Dictionary with 10 tags: LIG, REC, , O, S-Gene, B-Gene, I-Gene, E-Gene, ,

However, when I do

tagger_new = SequenceTagger( hidden_size=256, embeddings=previous_tagger.embeddings, tag_dictionary=new_tag_dictionary, tag_type='ner', ) it outputs

2024-04-26 16:16:31,545 SequenceTagger predicts: Dictionary with 37 tags: O, S-LIG, B-LIG, E-LIG, I-LIG, S-REC, B-REC, E-REC, I-REC, S-O, B-O, E-O, I-O, S-S-Gene, B-S-Gene, E-S-Gene, I-S-Gene, S-B-Gene, B-B-Gene, E-B-Gene, I-B-Gene, S-I-Gene, B-I-Gene, E-I-Gene, I-I-Gene, S-E-Gene, B-E-Gene, E-E-Gene, I-E-Gene, S-<START>, B-<START>, E-<START>, I-<START>, S-<STOP>, B-<STOP>, E-<STOP>, I-<STOP> These are too many tags. Any help will me appreciated.