Closed tanpengshi closed 2 years ago
It's strange. This line will let the code keep the order of the input data if --keep_order
is activated. Have you changed the code of train.py
?
I did change the codes in flair.datasets.py
This is because my Indonesian dataset only has 2 columns, while the original data format has 4 columns. Furthermore, i do not have the '-DOCSTART- -X- -X- O' line, as it is a single document
Could this change be a problem?
The other big issue I encountered was the number of lines created by the .conllu file is less than the lines in the test set. On top of the sentences being randomized
The train.py
read the parse dataset with 2 columns by default. I think you don't need to change the flair/datasets.py
. For the output .conllu file, have you checked the number of sentences is the same as the input file? Maybe you can give some screenshots/examples for the issue you met.
For the output .conllu file, the number of sentences is much lesser than that in the input file. The screenshot of the beginning of the input file is:
The screenshot of the beginning of the output .conllu file is:
In the flair/datasets.py, I have to copy/paste a new class:
I also have to change: columns = {0: "text", 1: "pos", 2: "chunk", 3: "ner"}
to columns = {0: "text", 1: "ner"}
, if not data parsing will throw an error when I run the command.
How do you change the code of train.py? Since you are using a different corpus class for your dataset, the train.py should be changed accordingly. Moreover, I believe you do not need to create a new class for it since ColumnCorpus
in train.py reads two-column format by default.
By the way, can I take a look for your config file?
I actually did not change anything in the code train.py, although I did copy and paste a new class NERGRIT in flair/datasets.py, with only changes to the column variable.
Interestingly while I was doing debugging, I stumbled across a functional solution. In the class ColumnCorpus in flair/datasets.py:
The train variable did has the right number and right order of test sentences. Hence, I created a new attribute to the class:
Then in train.py, i changed corpus.train into corpus.training in the below script, and this solved the bug.
In the last line of the __init__()
of class ColumnCorpus, the code will initialize the class Corpus
in flair/data.py, which assigns the property of the class self.train=train
.
I'm not sure what happened in your environment, but I'm glad to know you have solved the bug.
I have the same problem. @tanpengshi 's method works but hacky. The problem is when --parse
is used, the code shall not split 10% of train data for test or dev data as this is not training but inference. See https://github.com/Alibaba-NLP/ACE/blob/main/flair/datasets.py#L107
Would be great if the author can clean up the logic.
I have tried to run:
CUDA_VISIBLE_DEVICES=0 python train.py --config config/nergrit_indo.yaml --parse --target_dir $dir --keep_order
for a indonesian NER dataset trained with ACE. However in the ouput .conllu file, the output sentence order is always randomized, each time I run the command, and does not match the sentence order in my test set.
Kindly assist, thank you!