Request for preprocessing script

dmis-lab / BioSyn

ACL'2020: Biomedical Entity Representations with Synonym Marginalization

https://arxiv.org/abs/2005.00239

MIT License

160 stars 26 forks source link

Request for preprocessing script #4

Closed ArnaudFerre closed 3 years ago

ArnaudFerre commented 3 years ago

Hi,

I would like to reproduce the BioSyn results on the NCBI Disease Corpus (and to do an ablation study). I was able to use your core method (+lowercased) on this corpus (and some others), but without the resolution of composite mentions and acronyms, I only get around 0.801 of top 1 accuracy on your published 0.911.

Could you please send me a pre-processing script?

Kind regards, Arnaud

ArnaudFerre commented 3 years ago

I have also a minor question: how do you calculate the score for mentions that should be normalized by more than one concept? (I find that represents only 0,4% of case in the NCBI Disease Corpus) It seems that you give 1 point if the predicted concept is one of the corrects, right?

mjeensung commented 3 years ago

Hi, we're planning to upload preprocessing code soon.

As for mentions that have more than one concept, there are two cases. Case1) If it's a composite mention, we first split it into single mentions and consider it correct if the predictions of all single mentions are correct.

Case2) If it's not a composite mention, yes, we consider it correct if one of the concepts are correctly predicted.

ArnaudFerre commented 3 years ago

Hi,

Thank you for your answer.

Yes, I had indeed doubts about case 2.

Sorry for this question, but do you have a more accurate estimatation for the upload of the preprocessing script? Given your results, I would have appreciated to have BioSyn in my study, but I unfortunately have deadlines to meet...

If you can't provide this script in the state that suits you quickly, don't you have another possibility? For example, what does the query_preprocess.py do for the TAC-ADR-2017 data? Only acronyms resolution (+lowercasing and punctuations removing)? In the worst case, I can try to use Ab3P directly, and in that case, I would just need to know what you used to resolve compound mentions.

Finally, regarding the accuracies I gave you, I would like to have your feedback on their plausibility. Would you also have observed such a gain with the acronyms/composite mentions resolutions you applied?

Kind regards, Arnaud

mjeensung commented 3 years ago

Hello, thank you for your patience.

I just uploaded the preprocessing scripts for the NCBI-disease dataset.

I've observed that the abbreviation/composite resolutions are very important steps for normalizing mentions better.

ArnaudFerre commented 3 years ago

Hi, Thank you very much for the scripts and for your answer. Kind regards, Arnaud