How to reproduce the results on MedMentions dataset?

dmis-lab / BioSyn

ACL'2020: Biomedical Entity Representations with Synonym Marginalization

https://arxiv.org/abs/2005.00239

MIT License

160 stars 26 forks source link

How to reproduce the results on MedMentions dataset? #6

Closed amirj closed 3 years ago

amirj commented 3 years ago

May I ask you to provide more information about how to train/evaluate BioSyn on MedMentions dataset please?

mjeensung commented 3 years ago

Hi amirj

I'm sorry but we don't have a plan to provide the scripts for MedMentions yet. But, as long as you preprocess the dataset following our scripts for other datasets (e.g., NCBI-disease), you will be able to train/evaluate it. If you need help working on it, please feel free to ask.

Best regards Mujeen

amirj commented 3 years ago

Thanks @mjeensung Can you provide a step-by-step guide with pointing to the corresponding script to preprocess/train/evaluate based on UMLS and MedMentions dataset please? I do appreciate your help.

mjeensung commented 3 years ago

I recommend you to follow the documentation on how to preprocess NCBI-disease dataset. (https://github.com/dmis-lab/BioSyn/tree/master/preprocess)

As for the dictionary, you need to convert UMLS to the list of CUIs and their names like "D000853||anophthalmos" (Please see the train_dictionary.txt in NCBI-disease dataset for the output)

As for the train/test dataset, you need to convert MedMentions dataset to the list of the information of the mentions like "23402||77|94||SpecificDisease||neonatal jaundice||D007567" (Please see any file in the processed_train folder in NCBI-disease dataset for the output)

amirj commented 3 years ago

Thanks @mjeensung. It seems that there are three dictionaries for NCBI-disease dataset: (train/dev/test)_dictionary.txt

As you mentioned, I need to convert UMLS aliases to this file to map each alias to the corresponding CUIs. I'm wondering what's the difference between train/dev/test_dictionary.txt?

What would happen if I create a big dictionary from UMLS and reuse it for train/dev/test?

amirj commented 3 years ago

As another question: dataset files (inc. dictionaries and query files) should be pre-processed (lowercase,...) or BioSyn will manage that?

It seems that both dictionaries and query files pre-processed in the provided NCBI dataset, though checking the source codes, dictionaries are pre-processed but query files not.

mjeensung commented 3 years ago

The difference between train/dev_dictionary.txt is that dev_dictionary.txt has the mentions of the train queries apart from the original dictionary to increase the coverage. Likewise, test_dictionary.txt contains the mentions of train and dev queries.

Query files should be pre-processed (e.g., lowercased, abbreviation resolution, ..) before using BioSyn as well. (Please see https://github.com/dmis-lab/BioSyn/blob/master/preprocess/query_preprocess.py)