UKPLab / emnlp2017-bilstm-cnn-crf

BiLSTM-CNN-CRF architecture for sequence tagging
Apache License 2.0
822 stars 263 forks source link

Error while computing F1-score #53

Closed Mahmedturk closed 4 years ago

Mahmedturk commented 4 years ago

Hi,

I am doing NER on a dataset which is tagged in BIO encoding scheme, and for 7th epoch i get the below error while computing F1-score. Does that mean there is something wrong with the dataset labels? If yes, how can i print the predictions and know where exactly is the problem?

Documents/Gitstuff/emnlp2017-bilstm-cnn-crf/util/BIOF1Validation.py", line 185, in checkBIOEncoding assert(False) #Should never be reached AssertionError

nreimers commented 4 years ago

Hi @Mahmedturk the error indicates that the label is invalid, i.e., not a valid BIO label. Valid BIO labels are only the labels 'O' (the letter, not the number) or B- or I-.

If you have in your label set for example the label 'ABC', and this label is predicted, then this error is raised.

Do you maybe use some other tag set, like IOBES? Then you must set the right config value so that it is converted to BIO before the F1 score is computed.

You maybe some noise is part of you label set? This can happen, if the column structure of CoNLL files is not strictly followed.

Mahmedturk commented 4 years ago

Which CoNLL format to be precise? CoNLL 2009?

nreimers commented 4 years ago

The CoNLL format that was used 2000 for chunking or 2003 for NER.

Mahmedturk commented 4 years ago

I have completed scanning my whole dataset, I have all the labels either O, "B-(an integer)" or "I-(an integer)". There's no label other than that. Also, if there is noise in the labels, could comment on why the program works for first few epochs and then generates Assertion Error?

nreimers commented 4 years ago

Hi, the error should be in the gold labels.

You can inspect the gold labels by inspecting the mappings variable. I will use Train_Chunking.py as an example.

In line 59, add:

print(mappings['your_label_key']) # I.e., for Train_Chunking.py, set it to chunk_BIO

This should print you the mappings for your labels.

I expect that one of the label will not follow the BIO encoding scheme, i.e., it will be different than O, B- or I-

Next you need to find out why this label is in your dataset.

The error occurs once a faulty label is predicted.

Best Nils Reimers

Mahmedturk commented 4 years ago

Hi @nreimers

This has helped me identify the issue. The problem was, one of rows in token column had two tokens, and the second token was read in as a label. After deleting that token now the code is working.

nreimers commented 4 years ago

Glad that it works now. Had a similar issue before, that a line contained more than 1 token