myeghaneh commented 9 months ago

Question

I am using now after a while a new version flair 0.12.2 for the script that i used before ( (before that, I guess I used 0.10)to train a Sequence Tagger on "Persian" corpus using BIO scheme. The part that I do not understand is so : `` BIO-format Labels Not Recognized as you see here

2023-10-10 15:24:26,756 Reading data from ..\Corpus 2023-10-10 15:24:26,757 Train: ..\Corpus\trainPAMT_V03_F01.txt 2023-10-10 15:24:26,758 Dev: None 2023-10-10 15:24:26,758 Test: ..\Corpus\testPAMT_V03_F01.txt Corpus: 67 train + 7 dev + 38 test sentences Dictionary with 3 tags: O, , 2023-10-10 15:24:36,075 SequenceTagger predicts: Dictionary with 3 tags: O, , 2023-10-10 15:24:36,379 ---------------------------------------------------------------------------------------------------- 2023-10-10 15:24:36,383 Model: "SequenceTagger( (embeddings): TransformerWordEmbeddings( (model): RobertaModel( (embeddings): RobertaEmbeddings( (word_embeddings): Embedding(50266, 768) (position_embeddings): Embedding(514, 768, padding_idx=1) (token_type_embeddings): Embedding(1, 768) (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) `` Then I observed there is no improvment, which it makes sence with labels

Dictionary with 3 tags: O, <START>, <STOP> here is how my data look likes :

Since it was working before ( It could recognize all the tages, now can not ) I am wondering what is the problem

here is my training code

`` data_folder = '../Corpus/'

columns = {0: 'text', 1: 'pos', 2: 'BIO'}

n = 3 kf = KFold(n_splits=n, random_state=1, shuffle=True) results = [] for train_index, val_index in kf.split(DATA): train_df = DATA.iloc[train_index] test_df =DATA.iloc[val_index]

with open('../Corpus/trainPAMT_V03_F01.txt', 'w', encoding = "utf-8-sig") as f:
    for l in train_df:
        for tpl in l:
            f.write('{} {} {}'.format(tpl[0],tpl[1], tpl[2]))
            f.write('\n')   
        f.write('\n')
with open('../Corpus/testPAMT_V03_F01.txt', 'w', encoding = "utf-8-sig") as f:
    for l in test_df:
        for tpl in l:
            f.write('{} {} {}'.format(tpl[0],tpl[1], tpl[2]))
            f.write('\n')
        f.write('\n')

corpus:Corpus = ColumnCorpus(data_folder, columns,train_file='trainPAMT_V03_F01.txt',test_file='testPAMT_V03_F01.txt')
print(corpus)

tag_type = 'BIO'

tag_dictionary = corpus.make_tag_dictionary(tag_type=tag_type)
print(tag_dictionary)
#farsi_embedding = WordEmbeddings('fa-crawl')

#onehot_embeddings=OneHotEmbeddings(corpus, field="pos") 
embedding_types: List[TokenEmbeddings] = [
WordEmbeddings('fa-crawl'),
FlairEmbeddings('fa-forward'),
FlairEmbeddings('fa-backward'),
 ]

embeddings = TransformerWordEmbeddings('roberta-base')
#embeddings: StackedEmbeddings = StackedEmbeddings(embeddings=embedding_types)
tagger: SequenceTagger = SequenceTagger(hidden_size=2048,
                                        embeddings=embeddings,
                                        tag_dictionary=tag_dictionary,
                                        tag_type=tag_type,use_crf=True 
                                        )

trainer: ModelTrainer = ModelTrainer(tagger, corpus)

# 7. start training
trainer.train('../resource/PersianSeqTaggerFACrawl_lr8_batch8E75', embeddings_storage_mode='none',
              learning_rate=.8,
              mini_batch_size=8,
              max_epochs=5)

``

Thank you in advance, I have the feeling the format of corpus has some problem now.

IvanDePivan commented 9 months ago

Does your input data contain spaces or tabs as separators?

myeghaneh commented 9 months ago

I would say space. As you in screenshoot, there is a space.

However, it might be that there is some problem there and I have the same feeling that is related to the format of my data

The input data has been provided so: (spaCy has no pre-train model for Persian, I did it with package, called: Hazm)

``

nlp=Persian() tagger = POSTagger(model = '../resource/pos_tagger.model' ,universal_tag = True ) def mapTokenPos(l): tokenlist=[] for x in l:

a=word_tokenize(x)

    b=tagger.tag(word_tokenize(x))
    tokenlist.append(b)
return tokenlist

``

IvanDePivan commented 9 months ago

I'm not sure how this relates to your data, but I had to replace the spaces with tabs to get it to work, might help you as well.

myeghaneh commented 9 months ago

Hi Ivan,

thank you for your response, I did not get your point, you mean replace
\n with --> ``\t"" in my entire code to make a corpus? then define \t as delimiter? (I would guess it should only for labels, last Column correct?

it was working before with \n. is it related to update ?

I see here there is some new stuffs https://flairnlp.github.io/docs/tutorial-training/how-to-load-custom-dataset

IvanDePivan commented 9 months ago

Apparently, you can indeed specify a delimiter. I mean to separate the columns, so you have [WORD]\t[POS]\t[TAG] The last release is a while ago, so I'm not sure what's changed.

you can find the columncorpus here: https://github.com/flairNLP/flair/blob/42ea3f6854eba04387c38045f160c18bdaac07dc/flair/datasets/sequence_labeling.py#L373

For me, using tabs as a delimiter works, you might also get away with specifying space as your delimiter.

flairNLP / flair

[Question]: Issue with NER Custom Model: BIO-format Labels (Tages) Not Recognized #3329

Question

columns = {0: 'text', 1: 'pos', 2: 'BIO'}

a=word_tokenize(x)