flairNLP / flair

A very simple framework for state-of-the-art Natural Language Processing (NLP)
https://flairnlp.github.io/flair/
Other
13.81k stars 2.09k forks source link

[Bug]: corpus.make_label_dictionary generate too many tags #3294

Open ijazul-haq opened 1 year ago

ijazul-haq commented 1 year ago

Describe the bug

I have only 38 tags in my POS corpus but corpus.make_label_dictionary return a dictionary of 400 tags.

To Reproduce

columns = {0: 'text', 1: 'pos'}
corpus: Corpus = ColumnCorpus('dataset/flair/', columns,train_file='train.txt',test_file='test.txt',dev_file='dev.txt')

label_dict = corpus.make_label_dictionary(label_type='pos', add_unk=True)
print(label_dict)

Expected behavior

I expect the length of label_dict to be 38 tags.

Logs and Stack traces

2023-08-07 22:04:49,436 Dictionary created for label 'pos' with 249 values: IN (seen 2771 times), JJ (seen 1938 times), NN.C.1.M (seen 1831 times), NN.C.2 (seen 1303 times), NN.C.1.F (seen 1211 times), CC (seen 1152 times), PT (seen 1140 times), RB (seen 914 times), NN.P (seen 846 times), DT (seen 666 times), VB.DX (seen 515 times), VB.PC (seen 432 times), NB (seen 365 times), VB.P (seen 342 times), VB.D (seen 332 times), PU (seen 314 times), PR.C (seen 248 times), VB.H (seen 227 times), VB.DC (seen 153 times), NG (seen 147 times)
Dictionary with 249 tags: <unk>, IN, JJ, NN.C.1.M, NN.C.2, NN.C.1.F, CC, PT, RB, NN.P, DT, VB.DX, VB.PC, NB, VB.P, VB.D, PU, PR.C, VB.H, VB.DC, NG, VB.G, VB.INF, BA, PR.P.iii, RP, VB.PX, PR.P.i, FX, VB.IMP, PR.P.ii, PR.P$, PR.W, VB.N, PR.DIS, FW, امله, ویروس, خان, شمېر, چارو, کبله, جام, مخې, څه, اباد, ورځ, ملتونو, ډګر, عربستان

Screenshots

No response

Additional Context

No response

Environment

flair = 0.12.2 torch = 2.0.1 Python = transformers = 4.31.0

ijazul-haq commented 1 year ago

Python = 3.9.17

helpmefindaname commented 1 year ago

Hi @ijazul-haq please notice that since you are using a custom private dataset, we cannot judge what is not working. You can debug this issue by: