Closed Dimiftb closed 2 years ago
Hi, it seems that the dataset is not in the directory /root/.flair/datasets/conll_03
. Have you put the dataset in that directory?
Hi @wangxinyu0922,
Thanks for your reply. Following instructions on this: https://github.com/flairNLP/flair/blob/master/resources/docs/EXPERIMENTS.md
I installed flair with pip install and using the command below I set the location of the dataset:
Unfortunately, I'm faced with the same error. I believe this folder /root/.flair/datasets/conll_03
is within the flair package, however your model seems to use a modified version of flair.
In addition, following the instructions, it doesn't specify anywhere to download the conll dataset, compile it and put it in a specific place. Are there instructions missing or is it me that can't find them?
After installing the requirements, I'm following this https://github.com/Alibaba-NLP/ACE#pretrained-models.
Can you show whether the data exactly in this directory? For example, you may use the Linux command ls /root/.flair/datasets/conll_03
. I suspect the dataset is not in the right place.
By the way, we also provide the dataset (ner_data.zip) on Onedrive in the guide. You may download the dataset again and put the dataset at /root/.flair/datasets/conll_03
.
I'm using colab to replicate the results. The .flair folder is within the flair package I believe. Flair is not even installed after running pip install -r requirements.txt. Do I have to install it in order to run the code?
You don't need to install flair. Again, please check whether the dataset is in the right directory.
If you cannot check that, optionally, you may manually set the dataset folder following this guide
Hi @wangxinyu0922,
I managed to resolve the issue, thank you very much. I've still not successfully replicated the results, however if I need more help I'll open another issue. Closing this now
Hi @wangxinyu0922 @dimiftb Can you please tell how to resolve the issue ? I'm facing the same.
Hi @wangxinyu0922 Can you please tell how to resolve the issue ? I'm facing the same.
Hi @AtharvanDogra , if your dataset is missing, you may follow https://github.com/Alibaba-NLP/ACE#train-on-your-own-dataset to manually set the path to your conll_03 dataset.
Hi @wangxinyu0922
I have tried it
ner:
Corpus: CONLL_03
ColumnCorpus-1:
data_folder: resources/tasks/conll_03_english
column_format:
0: text
1: pos
2: chunk
3: ner
tag_to_bioes: ner
tag_dictionary: resources/taggers/ner_tags.pkl
This is what I've written in the yaml file and still its giving the same error, unable to find the dataset
Please tell me where am I going wrong
@wangxinyu0922 @Dimiftb I've created my own config file with new directory, still its giving the same error. What shall I do?
@wangxinyu0922 @Dimiftb I've created my own config file with new directory, still its giving the same error. What shall I do?
@AtharvanDogra you need the config file like this:
ner:
Corpus: ColumnCorpus-1
ColumnCorpus-1:
data_folder: resources/tasks/conll_03_english
column_format:
0: text
1: pos
2: chunk
3: ner
tag_to_bioes: ner
tag_dictionary: resources/taggers/ner_tags.pkl
The change is from Corpus: CONLL_03
to Corpus: ColumnCorpus-1
Hi @AtharvanDogra,
the way I resolved the issue is I simply create the directory /.flair/datasets and put the conll dataset there. If you’re using colab it’d be /content/.flair/datasets/conll_03
.
@Dimiftb Thanks for your reply. @wangxinyu0922 I was able to get that error away with what you told.
Right now I am facing another issue regarding which I've seen you've made some changes for another issue raised.
Where am I going wrong? I am using the config file provided, as it is, just making these changes
ner:
Corpus: ColumnCorpus-1
ColumnCorpus-1:
data_folder: resources/tasks/conll_03_english
column_format:
0: text
1: pos
2: chunk
3: ner
tag_to_bioes: ner
tag_dictionary: resources/taggers/ner_tags.pkl
The above happens when I am running the code to test the model i.e. with the --test flag.
When I am trying to train a model using
CUDA_VISIBLE_DEVICES=0 python train.py --config config/copyConfig.yaml
it keeps crashing due to lack of ram.
I've reduced the mini batch size upto 2 and was still facing the problem:
@Dimiftb Thanks for your reply bro. @wangxinyu0922 I was able to get that error away with what you told.
Right now I am facing another issue regarding which I've seen you've made some changes for another issue raised.
Where am I going wrong? I am using the config file provided, as it is, just making these changes
ner: Corpus: ColumnCorpus-1 ColumnCorpus-1: data_folder: resources/tasks/conll_03_english column_format: 0: text 1: pos 2: chunk 3: ner tag_to_bioes: ner tag_dictionary: resources/taggers/ner_tags.pkl
You need to download the xlm-roberta-large-finetuned-conll03-english and put it at resources/
@wangxinyu0922 ISSUE 1 [Prediction on data set provided by you]:
Right now the issue I am facing, while predicting, is this, the train test dev set become none even when i've provided the correct path.
When the prediction starts no no such issue, the dataset files are read correctly but after loading the models, it shows this issue.
867 line in the datasets.py as given in the error message
Line 93
From the 93, the call is to this class and the Path variable here is receiving a none.
ISSUE 2 [Training model on my own set of embeddings and dataset]: I get killed after bert-larg-uncased is shown in the training log. I've reduced the batch size to 1. I am using colab pro + with 50 gb of ram and v100 gpu.
ISSUE 3 [need BIO tagging instead of BIOES]:
While training I am able to see the tag dictionary and although it has my tags but it is also showing BIOES tags while my dataset only has BIO tags
@wangxinyu0922
Resource usage
Batch size 1
Right now the issue I am facing, while predicting, is this, the train test dev set becomes none even when i've provided the correct path.
@AtharvanDogra
Issue 1 The target_dir
should be the dataset you want to parse but not the output dir. By the way, the default output dir is outputs
Issue 2 It seems that there is an OOM on system RAM. I suggest remove some of the embedding, for example, remove the backward of flair and multilingual flair since each embedding has 2048 hidden size.
Issue 3 The code will automatically convert BIO format into BIOES format and the code only accept BIO formatting. Therefore you need not worry about that if the input dataset is BIO format
Right now the issue I am facing, while predicting, is this, the train test dev set becomes none even when i've provided the correct path.
@AtharvanDogra Issue 1 The
target_dir
should be the dataset you want to parse but not the output dir. By the way, the default output dir isoutputs
Issue 2 It seems that there is an OOM on system RAM. I suggest remove some of the embedding, for example, remove the backward of flair and multilingual flair since each embedding has 2048 hidden size. Issue 3 The code will automatically convert BIO format into BIOES format and the code only accept BIO formatting. Therefore you need not worry about that if the input dataset is BIO format
I'll give the suggestions for 1 and 2 try right now and inform. Regarding the 3rd I need the predictions also in BIO only, will it produce the E-S tags while predicting? @wangxinyu0922
@wangxinyu0922
After changing the target_set for ISSUE 1 , the following error occurs:
Do i need to replace all these embedding with my new trained model:
like this:
when I do that, this error occurs:
Otherwise, keeping the embeddings same, the 1st error occurs
@AtharvanDogra
For Issue 1, as descried in readme.md, you need to change line 232
to modify the column_format
to follow your input dataset. I think possibly your input dataset (target_dir
) is four columns, therefore you may change column_format={0: 'text', 1:'ner'}
into column_format={0: 'text', 1:'pos', 2:'pos', 3:'ner'}
.
For your second question, you need to replace the embedding model when you have newly trained embeddings, but you don't need to do that if you just use these embedding to train an ACE model and just want to test the trained model.
@AtharvanDogra
For Issue 1, as descried in readme.md, you need to change line
232
to modify thecolumn_format
to follow your input dataset. I think possibly your input dataset (target_dir
) is four columns, therefore you may changecolumn_format={0: 'text', 1:'ner'}
intocolumn_format={0: 'text', 1:'pos', 2:'pos', 3:'ner'}
.For your second question, you need to replace the embedding model when you have newly trained embeddings, but you don't need to do that if you just use these embedding to train an ACE model and just want to test the trained model.
The 2nd and 3rd column in my dataset is just " " two underscores, so can i just remove the underscores and and keep the column_format as it is. because in the prediction also I just need the tags.
And is the change in line 232 required for training as well or just for prediction @wangxinyu0922
@AtharvanDogra Sure you can remove the underscores for the prediction. Line 232 is just for prediction.
This is line 232
This is where the change might be required. Will give it a try.
Sorry for asking so many questions but I have the last 10 hours to somehow produce a good prediction and I have tried a lot of things 😢
This is line 232
This is where the change might be required. Will give it a try. Sorry for asking so many questions but I have the last 10 hours to somehow produce a good prediction and I have tried a lot of things 😢
Oh sorry, it should be this line. Alternatively, you can use two columns formatting for prediction.
CUDA_VISIBLE_DEVICES=0 python train.py --config config/copyConfig.yaml --parse --target_dir resources/tasks/conll_03_english --keep_order
The code line i am running
When no outputs folder created
When i've created an outputs folder
@wangxinyu0922
@AtharvanDogra Possibly the file name is too long (the linux system does not allow file name longer than 256 characters). I suggest you can modify line 341. For example, remove config.config['model_name']
in the out_path
. Note that if you remove this term, the output file name will be the same if you parse multiple datasets with the same model.
@wangxinyu0922 First of all, Thanks a lot for helping me and bringing me up to here 🥲
1st ISSUE
Some final issues I am facing right now are the tagging is still BIOES while my dataset is BIO tagged. I saw a lot of lines in throughout the code that mentioned bioes tagger so in order to change the tagging scheme will I have to change them all or is there a simpler way.
The correct predictions are in the 2nd column. So an alternative I can see is simply take out the 2nd column, which is what my final submission is and replace the "E-" tags with "I-" and "S-' tags with "B-". Thats what I could observe. Please tell with what method should I proceed and if the alternative approach would cause any depletion in accuracy of the predictions. I do not have much knowledge about BIOES tagging.
2nd ISSUE The predictions I got were for the train set what i provided and not the testb set which is I supposed the correct thing to happen. I want to train the model on the train set and testa set which is the dev set and finally need the predictions on testb set. How should I proceed?
@AtharvanDogra For your first issue, please replace 'S-' with 'B-' and 'E-' with 'I-'. The process will not affect the model accuracy.
For second issue, you may use dev set as the dev and test set during training. Then you can use the test set for prediction. Note that the model will use the training set for prediction, so you need to modify the name, for example, test.conll
to train.conll
.
@AtharvanDogra For your first issue, please replace 'S-' with 'B-' and 'E-' with 'I-'. The process will not affect the model accuracy.
For second issue, you may use dev set as the dev and test set during training. Then you can use the test set for prediction. Note that the model will use the training set for prediction, so you need to modify the name, for example,
test.conll
totrain.conll
.
I'll do that for the 1st issue.
And yes that's the problem for me while predicting (in 2nd issue). In my case I only have a train and dev set right now and until the test phase begins so I am using the dev set only for both the dev and test set. While making prediction should I remove all other datasets or just rename the train and testa(dev) to testa and testb and name the trainb (the test set) to train.txt
@wangxinyu0922
@AtharvanDogra For prediction, make a new directory for the test set, and then copy the test.txt
with train.txt
dev.txt
and test.txt
into your new directory.
@wangxinyu0922 Thank you very much for all your help. I've finally made a successful submission. I'll keep tweaking it for the next few hours now, trying to get a better result on my task. I had to ask so many questions as this was my first task on NER and even NLP as a whole, I'm sorry for that
(the screenshot is from
bert-en-ner-finetune.yaml
)
These single files are not available in the config folder. I am checking the finetuning part and am trying to finetune bert-large, How should I proceed?
@wangxinyu0922
HI,
I've been following the instructions on your tutorial to try and replicate your experimental results.
I've installed requirements.txt and also the transformers lib.
Then I downloaded the pertained model from the OneDrive you've supplied conll_en_ner_model.zip and extracted it in directory resources/taggers
Then using the command you've supplied:
Gives me the following result:
I downloaded and extracted the conll dataset as specified in the link in the error, however I keep getting the same error. Could you please tell me what am I doing wrong?