Issue with Replicating results on CoNLl

Dimiftb commented 3 years ago

HI,

I've been following the instructions on your tutorial to try and replicate your experimental results.

I've installed requirements.txt and also the transformers lib.

Then I downloaded the pertained model from the OneDrive you've supplied conll_en_ner_model.zip and extracted it in directory resources/taggers

Then using the command you've supplied:

Gives me the following result:

I downloaded and extracted the conll dataset as specified in the link in the error, however I keep getting the same error. Could you please tell me what am I doing wrong?

wangxinyu0922 commented 3 years ago

Hi, it seems that the dataset is not in the directory /root/.flair/datasets/conll_03. Have you put the dataset in that directory?

Dimiftb commented 3 years ago

Hi @wangxinyu0922,

Thanks for your reply. Following instructions on this: https://github.com/flairNLP/flair/blob/master/resources/docs/EXPERIMENTS.md

I installed flair with pip install and using the command below I set the location of the dataset:

Unfortunately, I'm faced with the same error. I believe this folder /root/.flair/datasets/conll_03 is within the flair package, however your model seems to use a modified version of flair.

In addition, following the instructions, it doesn't specify anywhere to download the conll dataset, compile it and put it in a specific place. Are there instructions missing or is it me that can't find them?

After installing the requirements, I'm following this https://github.com/Alibaba-NLP/ACE#pretrained-models.

wangxinyu0922 commented 3 years ago

Can you show whether the data exactly in this directory? For example, you may use the Linux command ls /root/.flair/datasets/conll_03. I suspect the dataset is not in the right place.

By the way, we also provide the dataset (ner_data.zip) on Onedrive in the guide. You may download the dataset again and put the dataset at /root/.flair/datasets/conll_03.

Dimiftb commented 3 years ago

I'm using colab to replicate the results. The .flair folder is within the flair package I believe. Flair is not even installed after running pip install -r requirements.txt. Do I have to install it in order to run the code?

wangxinyu0922 commented 3 years ago

You don't need to install flair. Again, please check whether the dataset is in the right directory.

If you cannot check that, optionally, you may manually set the dataset folder following this guide

Dimiftb commented 3 years ago

Hi @wangxinyu0922,

I managed to resolve the issue, thank you very much. I've still not successfully replicated the results, however if I need more help I'll open another issue. Closing this now

AtharvanDogra commented 2 years ago

Hi @wangxinyu0922 @dimiftb Can you please tell how to resolve the issue ? I'm facing the same.

wangxinyu0922 commented 2 years ago

Hi @wangxinyu0922 Can you please tell how to resolve the issue ? I'm facing the same.

Hi @AtharvanDogra , if your dataset is missing, you may follow https://github.com/Alibaba-NLP/ACE#train-on-your-own-dataset to manually set the path to your conll_03 dataset.

AtharvanDogra commented 2 years ago

Hi @wangxinyu0922
I have tried it

ner:
  Corpus: CONLL_03
  ColumnCorpus-1:
    data_folder: resources/tasks/conll_03_english
    column_format:
        0: text
        1: pos
        2: chunk
        3: ner
    tag_to_bioes: ner
  tag_dictionary: resources/taggers/ner_tags.pkl

This is what I've written in the yaml file and still its giving the same error, unable to find the dataset

Please tell me where am I going wrong

AtharvanDogra commented 2 years ago

@wangxinyu0922 @Dimiftb I've created my own config file with new directory, still its giving the same error. What shall I do?

wangxinyu0922 commented 2 years ago

@wangxinyu0922 @Dimiftb I've created my own config file with new directory, still its giving the same error. What shall I do?

@AtharvanDogra you need the config file like this:

ner:
  Corpus: ColumnCorpus-1
  ColumnCorpus-1:
    data_folder: resources/tasks/conll_03_english
    column_format:
        0: text
        1: pos
        2: chunk
        3: ner
    tag_to_bioes: ner
  tag_dictionary: resources/taggers/ner_tags.pkl

The change is from Corpus: CONLL_03 to Corpus: ColumnCorpus-1

Dimiftb commented 2 years ago

Hi @AtharvanDogra,

the way I resolved the issue is I simply create the directory /.flair/datasets and put the conll dataset there. If you’re using colab it’d be /content/.flair/datasets/conll_03.

AtharvanDogra commented 2 years ago

@Dimiftb Thanks for your reply. @wangxinyu0922 I was able to get that error away with what you told.

Right now I am facing another issue regarding which I've seen you've made some changes for another issue raised.

Where am I going wrong? I am using the config file provided, as it is, just making these changes

ner:
  Corpus: ColumnCorpus-1
  ColumnCorpus-1:
    data_folder: resources/tasks/conll_03_english
    column_format:
        0: text
        1: pos
        2: chunk
        3: ner
    tag_to_bioes: ner
  tag_dictionary: resources/taggers/ner_tags.pkl

AtharvanDogra commented 2 years ago

The above happens when I am running the code to test the model i.e. with the --test flag. When I am trying to train a model using CUDA_VISIBLE_DEVICES=0 python train.py --config config/copyConfig.yaml it keeps crashing due to lack of ram. I've reduced the mini batch size upto 2 and was still facing the problem:

wangxinyu0922 commented 2 years ago

@Dimiftb Thanks for your reply bro. @wangxinyu0922 I was able to get that error away with what you told.

Right now I am facing another issue regarding which I've seen you've made some changes for another issue raised.

Where am I going wrong? I am using the config file provided, as it is, just making these changes
ner:
  Corpus: ColumnCorpus-1
  ColumnCorpus-1:
    data_folder: resources/tasks/conll_03_english
    column_format:
        0: text
        1: pos
        2: chunk
        3: ner
    tag_to_bioes: ner
  tag_dictionary: resources/taggers/ner_tags.pkl

You need to download the xlm-roberta-large-finetuned-conll03-english and put it at resources/

AtharvanDogra commented 2 years ago

@wangxinyu0922 ISSUE 1 [Prediction on data set provided by you]:

Right now the issue I am facing, while predicting, is this, the train test dev set become none even when i've provided the correct path. When the prediction starts no no such issue, the dataset files are read correctly but after loading the models, it shows this issue.

867 line in the datasets.py as given in the error message Line 93 From the 93, the call is to this class and the Path variable here is receiving a none.

ISSUE 2 [Training model on my own set of embeddings and dataset]: I get killed after bert-larg-uncased is shown in the training log. I've reduced the batch size to 1. I am using colab pro + with 50 gb of ram and v100 gpu.

ISSUE 3 [need BIO tagging instead of BIOES]: While training I am able to see the tag dictionary and although it has my tags but it is also showing BIOES tags while my dataset only has BIO tags

AtharvanDogra commented 2 years ago

@wangxinyu0922 Resource usage Batch size 1

wangxinyu0922 commented 2 years ago

Right now the issue I am facing, while predicting, is this, the train test dev set becomes none even when i've provided the correct path.

@AtharvanDogra Issue 1 The target_dir should be the dataset you want to parse but not the output dir. By the way, the default output dir is outputs Issue 2 It seems that there is an OOM on system RAM. I suggest remove some of the embedding, for example, remove the backward of flair and multilingual flair since each embedding has 2048 hidden size. Issue 3 The code will automatically convert BIO format into BIOES format and the code only accept BIO formatting. Therefore you need not worry about that if the input dataset is BIO format

AtharvanDogra commented 2 years ago

Right now the issue I am facing, while predicting, is this, the train test dev set becomes none even when i've provided the correct path.

@AtharvanDogra Issue 1 The target_dir should be the dataset you want to parse but not the output dir. By the way, the default output dir is outputs Issue 2 It seems that there is an OOM on system RAM. I suggest remove some of the embedding, for example, remove the backward of flair and multilingual flair since each embedding has 2048 hidden size. Issue 3 The code will automatically convert BIO format into BIOES format and the code only accept BIO formatting. Therefore you need not worry about that if the input dataset is BIO format

I'll give the suggestions for 1 and 2 try right now and inform. Regarding the 3rd I need the predictions also in BIO only, will it produce the E-S tags while predicting? @wangxinyu0922

AtharvanDogra commented 2 years ago

@wangxinyu0922

After changing the target_set for ISSUE 1 , the following error occurs:

Do i need to replace all these embedding with my new trained model:

like this:

when I do that, this error occurs:

Otherwise, keeping the embeddings same, the 1st error occurs

wangxinyu0922 commented 2 years ago

@AtharvanDogra

For Issue 1, as descried in readme.md, you need to change line 232 to modify the column_format to follow your input dataset. I think possibly your input dataset (target_dir) is four columns, therefore you may change column_format={0: 'text', 1:'ner'} into column_format={0: 'text', 1:'pos', 2:'pos', 3:'ner'}.

For your second question, you need to replace the embedding model when you have newly trained embeddings, but you don't need to do that if you just use these embedding to train an ACE model and just want to test the trained model.

AtharvanDogra commented 2 years ago

@AtharvanDogra

For Issue 1, as descried in readme.md, you need to change line 232 to modify the column_format to follow your input dataset. I think possibly your input dataset (target_dir) is four columns, therefore you may change column_format={0: 'text', 1:'ner'} into column_format={0: 'text', 1:'pos', 2:'pos', 3:'ner'}.

For your second question, you need to replace the embedding model when you have newly trained embeddings, but you don't need to do that if you just use these embedding to train an ACE model and just want to test the trained model.

The 2nd and 3rd column in my dataset is just " " two underscores, so can i just remove the underscores and and keep the column_format as it is. because in the prediction also I just need the tags.

And is the change in line 232 required for training as well or just for prediction @wangxinyu0922

wangxinyu0922 commented 2 years ago

@AtharvanDogra Sure you can remove the underscores for the prediction. Line 232 is just for prediction.

AtharvanDogra commented 2 years ago

This is line 232

This is where the change might be required. Will give it a try. Sorry for asking so many questions but I have the last 10 hours to somehow produce a good prediction and I have tried a lot of things 😢

wangxinyu0922 commented 2 years ago

This is line 232

This is where the change might be required. Will give it a try. Sorry for asking so many questions but I have the last 10 hours to somehow produce a good prediction and I have tried a lot of things 😢

Oh sorry, it should be this line. Alternatively, you can use two columns formatting for prediction.

AtharvanDogra commented 2 years ago

CUDA_VISIBLE_DEVICES=0 python train.py --config config/copyConfig.yaml --parse --target_dir resources/tasks/conll_03_english --keep_order The code line i am running When no outputs folder created

When i've created an outputs folder @wangxinyu0922

wangxinyu0922 commented 2 years ago

@AtharvanDogra Possibly the file name is too long (the linux system does not allow file name longer than 256 characters). I suggest you can modify line 341. For example, remove config.config['model_name'] in the out_path. Note that if you remove this term, the output file name will be the same if you parse multiple datasets with the same model.

AtharvanDogra commented 2 years ago

@wangxinyu0922 First of all, Thanks a lot for helping me and bringing me up to here 🥲

1st ISSUE Some final issues I am facing right now are the tagging is still BIOES while my dataset is BIO tagged. I saw a lot of lines in throughout the code that mentioned bioes tagger so in order to change the tagging scheme will I have to change them all or is there a simpler way.

The correct predictions are in the 2nd column. So an alternative I can see is simply take out the 2nd column, which is what my final submission is and replace the "E-" tags with "I-" and "S-' tags with "B-". Thats what I could observe. Please tell with what method should I proceed and if the alternative approach would cause any depletion in accuracy of the predictions. I do not have much knowledge about BIOES tagging.

2nd ISSUE The predictions I got were for the train set what i provided and not the testb set which is I supposed the correct thing to happen. I want to train the model on the train set and testa set which is the dev set and finally need the predictions on testb set. How should I proceed?

wangxinyu0922 commented 2 years ago

@AtharvanDogra For your first issue, please replace 'S-' with 'B-' and 'E-' with 'I-'. The process will not affect the model accuracy.

For second issue, you may use dev set as the dev and test set during training. Then you can use the test set for prediction. Note that the model will use the training set for prediction, so you need to modify the name, for example, test.conll to train.conll.

AtharvanDogra commented 2 years ago

@AtharvanDogra For your first issue, please replace 'S-' with 'B-' and 'E-' with 'I-'. The process will not affect the model accuracy.

For second issue, you may use dev set as the dev and test set during training. Then you can use the test set for prediction. Note that the model will use the training set for prediction, so you need to modify the name, for example, test.conll to train.conll.

I'll do that for the 1st issue.

And yes that's the problem for me while predicting (in 2nd issue). In my case I only have a train and dev set right now and until the test phase begins so I am using the dev set only for both the dev and test set. While making prediction should I remove all other datasets or just rename the train and testa(dev) to testa and testb and name the trainb (the test set) to train.txt

@wangxinyu0922

wangxinyu0922 commented 2 years ago

@AtharvanDogra For prediction, make a new directory for the test set, and then copy the test.txt with train.txt dev.txt and test.txt into your new directory.

AtharvanDogra commented 2 years ago

@wangxinyu0922 Thank you very much for all your help. I've finally made a successful submission. I'll keep tweaking it for the next few hours now, trying to get a better result on my task. I had to ask so many questions as this was my first task on NER and even NLP as a whole, I'm sorry for that

AtharvanDogra commented 2 years ago

(the screenshot is from bert-en-ner-finetune.yaml) These single files are not available in the config folder. I am checking the finetuning part and am trying to finetune bert-large, How should I proceed? @wangxinyu0922

Alibaba-NLP / ACE

Issue with Replicating results on CoNLl #13