Open myeghaneh opened 9 months ago
Does your input data contain spaces or tabs as separators?
I would say space. As you in screenshoot, there is a space.
However, it might be that there is some problem there and I have the same feeling that is related to the format of my data
The input data has been provided so: (spaCy has no pre-train model for Persian, I did it with package, called: Hazm)
``
nlp=Persian() tagger = POSTagger(model = '../resource/pos_tagger.model' ,universal_tag = True ) def mapTokenPos(l): tokenlist=[] for x in l:
b=tagger.tag(word_tokenize(x))
tokenlist.append(b)
return tokenlist
``
I'm not sure how this relates to your data, but I had to replace the spaces with tabs to get it to work, might help you as well.
Hi Ivan,
thank you for your response, I did not get your point, you mean replace
\n
with --> ``\t""
in my entire code to make a corpus? then define \t as delimiter? (I would guess it should only for labels, last Column correct?
it was working before with \n. is it related to update ?
I see here there is some new stuffs https://flairnlp.github.io/docs/tutorial-training/how-to-load-custom-dataset
Apparently, you can indeed specify a delimiter. I mean to separate the columns, so you have [WORD]\t[POS]\t[TAG]
The last release is a while ago, so I'm not sure what's changed.
you can find the columncorpus here: https://github.com/flairNLP/flair/blob/42ea3f6854eba04387c38045f160c18bdaac07dc/flair/datasets/sequence_labeling.py#L373
For me, using tabs as a delimiter works, you might also get away with specifying space as your delimiter.
Question
I am using now after a while a new version flair 0.12.2 for the script that i used before ( (before that, I guess I used 0.10)to train a Sequence Tagger on "Persian" corpus using BIO scheme. The part that I do not understand is so : `` BIO-format Labels Not Recognized as you see here
2023-10-10 15:24:26,756 Reading data from ..\Corpus 2023-10-10 15:24:26,757 Train: ..\Corpus\trainPAMT_V03_F01.txt 2023-10-10 15:24:26,758 Dev: None 2023-10-10 15:24:26,758 Test: ..\Corpus\testPAMT_V03_F01.txt Corpus: 67 train + 7 dev + 38 test sentences Dictionary with 3 tags: O,,
2023-10-10 15:24:36,075 SequenceTagger predicts: Dictionary with 3 tags: O, ,
2023-10-10 15:24:36,379 ----------------------------------------------------------------------------------------------------
2023-10-10 15:24:36,383 Model: "SequenceTagger(
(embeddings): TransformerWordEmbeddings(
(model): RobertaModel(
(embeddings): RobertaEmbeddings(
(word_embeddings): Embedding(50266, 768)
(position_embeddings): Embedding(514, 768, padding_idx=1)
(token_type_embeddings): Embedding(1, 768)
(LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
)
``
Then I observed there is no improvment, which it makes sence with labels
Dictionary with 3 tags: O, <START>, <STOP>
here is how my data look likes :Since it was working before ( It could recognize all the tages, now can not ) I am wondering what is the problem
here is my training code
`` data_folder = '../Corpus/'
columns = {0: 'text', 1: 'pos', 2: 'BIO'}
columns = {0: 'text', 1: 'pos', 2: 'BIO'}
n = 3 kf = KFold(n_splits=n, random_state=1, shuffle=True) results = [] for train_index, val_index in kf.split(DATA): train_df = DATA.iloc[train_index] test_df =DATA.iloc[val_index]
``
Thank you in advance, I have the feeling the format of corpus has some problem now.