Training of Mention Detection Model (BERT-NER)

gourango01 commented 2 years ago

I would like to train the "mention detection model" on the GrailQA dataset, which can be done using /entity_linker/BERT_NER/run_ner.py. But it also expects GrailQA dataset in CoNLL-2003 format (i.e., question tokens tagged with ["B", "I", "O", "[CLS]", "[SEP]"] tags) which is not present in the repository. So would you please share the processed GrailQA dataset (train and valid split) or script for converting the dataset into the required format? Can you please confirm that the BERT-NER model has been trained with the parameters mentioned in /entity_linker/BERT_NER/run_ner.py ?

entslscheia commented 2 years ago

Hi @gourango01 ,

I have uploaded the file we used to train our NER model on GrailQA here. The logs for this part of experiments are not very complete, and I am unable to find the data processing script currently. However, it should be fairly straightforward to do the conversion, you can just use the friendly_name fields in our dataset as gold mentions for entities. Meanwhile, I'll try to look at the old server I used two years ago to see whether I can find the script and more logs on this. I'll let you know about my findings.

Feel free to let me know if you have any further concerns.

Best, Yu

gourango01 commented 2 years ago

Thanks for your quick response. I can see that the file shared by you contains 37238 questions that belong to the train, dev, and test split of the GrailQA dataset. But we don't have ground truth entities for test questions. So How come test questions with entity mentions are present in the above file? And Is the NER model trained on questions from train, dev, and test split? Or Am I missing something? Based on your suggestion of using, friendly_name fields in GrailQA dataset as gold mentions for entities, I found 36954 out of 44337 questions in the train set with at least one mentioned entity. And in the dev set, I found 5609 out of 6763 questions with at least one mentioned entity. So in total, I got 42563 questions from the train and dev set with at least one mentioned entity and which is more than 37238. So Am I missing something?

entslscheia commented 2 years ago

Thanks for your quick response. I can see that the file shared by you contains 37238 questions that belong to the train, dev, and test split of the GrailQA dataset. But we don't have ground truth entities for test questions. So How come test questions with entity mentions are present in the above file?

Thanks for pointing this out! The file previously uploaded by me was indeed a mistake; it was based on the training data generated from a previous train/dev/test split, which is different from the version that we finally released to the public. As a result, I've removed that file and replace it with the right one.

Based on your suggestion of using, friendly_name fields in GrailQA dataset as gold mentions for entities, I found 36954 out of 44337 questions in the train set with at least one mentioned entity. And in the dev set, I found 5609 out of 6763 questions with at least one mentioned entity. So in total, I got 42563 questions from the train and dev set with at least one mentioned entity and which is more than 37238. So Am I missing something?

I think the numbers are reasonable. You can also try the script I used to generate the NER training data and compare with yours.

Also, we kindly request that you do not distribute the previous file with others and may delete it from your side if possible. That file could potentially lead to some data leakage issues. Really sorry about the mistake and thank you for understanding!

Feel free to let me know if you have any further questions.

Best, Yu

gourango01 commented 2 years ago

Yes, now the numbers are matching. And I will not share the previous file with others. Thanks.

dki-lab / GrailQA

Training of Mention Detection Model (BERT-NER) #16