Parsed order for prediction of test set is randomized

Alibaba-NLP / ACE

[ACL-IJCNLP 2021] Automated Concatenation of Embeddings for Structured Prediction

Other

298 stars 44 forks source link

Parsed order for prediction of test set is randomized #36

Closed tanpengshi closed 2 years ago

tanpengshi commented 2 years ago

I have tried to run:

CUDA_VISIBLE_DEVICES=0 python train.py --config config/nergrit_indo.yaml --parse --target_dir $dir --keep_order

for a indonesian NER dataset trained with ACE. However in the ouput .conllu file, the output sentence order is always randomized, each time I run the command, and does not match the sentence order in my test set.

Kindly assist, thank you!

wangxinyu0922 commented 2 years ago

It's strange. This line will let the code keep the order of the input data if --keep_order is activated. Have you changed the code of train.py?

tanpengshi commented 2 years ago

I did change the codes in flair.datasets.py

This is because my Indonesian dataset only has 2 columns, while the original data format has 4 columns. Furthermore, i do not have the '-DOCSTART- -X- -X- O' line, as it is a single document

Could this change be a problem?

The other big issue I encountered was the number of lines created by the .conllu file is less than the lines in the test set. On top of the sentences being randomized

wangxinyu0922 commented 2 years ago

The train.py read the parse dataset with 2 columns by default. I think you don't need to change the flair/datasets.py. For the output .conllu file, have you checked the number of sentences is the same as the input file? Maybe you can give some screenshots/examples for the issue you met.

tanpengshi commented 2 years ago

For the output .conllu file, the number of sentences is much lesser than that in the input file. The screenshot of the beginning of the input file is:

The screenshot of the beginning of the output .conllu file is:

tanpengshi commented 2 years ago

In the flair/datasets.py, I have to copy/paste a new class:

I also have to change: columns = {0: "text", 1: "pos", 2: "chunk", 3: "ner"} to columns = {0: "text", 1: "ner"}, if not data parsing will throw an error when I run the command.

wangxinyu0922 commented 2 years ago

How do you change the code of train.py? Since you are using a different corpus class for your dataset, the train.py should be changed accordingly. Moreover, I believe you do not need to create a new class for it since ColumnCorpus in train.py reads two-column format by default.

By the way, can I take a look for your config file?

tanpengshi commented 2 years ago

I actually did not change anything in the code train.py, although I did copy and paste a new class NERGRIT in flair/datasets.py, with only changes to the column variable.

Interestingly while I was doing debugging, I stumbled across a functional solution. In the class ColumnCorpus in flair/datasets.py:

The train variable did has the right number and right order of test sentences. Hence, I created a new attribute to the class:

Then in train.py, i changed corpus.train into corpus.training in the below script, and this solved the bug.

wangxinyu0922 commented 2 years ago

In the last line of the __init__() of class ColumnCorpus, the code will initialize the class Corpus in flair/data.py, which assigns the property of the class self.train=train.

I'm not sure what happened in your environment, but I'm glad to know you have solved the bug.

junwei-h commented 2 years ago

I have the same problem. @tanpengshi 's method works but hacky. The problem is when --parse is used, the code shall not split 10% of train data for test or dev data as this is not training but inference. See https://github.com/Alibaba-NLP/ACE/blob/main/flair/datasets.py#L107

Would be great if the author can clean up the logic.