dmis-lab / biobert

Bioinformatics'2020: BioBERT: a pre-trained biomedical language representation model for biomedical text mining
http://doi.org/10.1093/bioinformatics/btz682
Other
1.93k stars 451 forks source link

For custom RE dataset with entity marked in advanced #166

Open KennyNg-19 opened 3 years ago

KennyNg-19 commented 3 years ago

Hi, as a green hand, I would like to ask some naive questions: for fine-tuning on a custom RE dataset with entity marked in advanced,

1) do we need to constrain what kind of entity marker or dummy words used for the BioNLP when marking the entity(e.g. @DISEASE$, [e] some disease [/e])?

2) when preprocessing, do we need add some code for helping the model to tokenized the entity? e.g. if we using [E1] to mark the entity, let the tokenizer knows it:

tokenizer.add_tokens(['[E1]', '[/E1]', '[E2]', '[/E2]', '[BLANK]'])

Hi Chloe,

Yes, you need to input task_name. If your dataset is a task of binary classification, you can use either of them. Basically, euadr and gad are processed in the same way (using BioBERTProcessor). https://github.com/dmis-lab/biobert/blob/37599fb978e3b584a6e9aa9abca1f38588bfff4f/run_re.py#L914-L917

Please be noticed that, however, chemprot dataset is a multi-class classification task. Hence it is processed in a different way and the same holds for the evaluation script.
Thank you for your interest in our work! Best, WonJin

wonjininfo commented 2 years ago

Hi Kenny, Thank you for your interest in our paper and my apologies for the delay in response.

You can use any entity marker or dummy words but please refrain from using some popular words. In my case, I utilized synthetic words like ENToGENEoMK. You need to register these words in the vocab. (Please see the next paragraph)

In order to add a custom token to the tokenizer, (for this repo; TensorFlow version) you need to modify vocab.txt. If you open the vocab.txt file, you can see the reserved [unused1] tokens at the beginning. You can replace these tokens with your custom tokens. I think your code tokenizer.add_tokens(['[E1]', '[/E1]', '[E2]', '[/E2]', '[BLANK]']) is for HuggingFace framework. Please check https://github.com/dmis-lab/biobert-pytorch for the pytorch-HuggingFace version codes.

Thank you and once again, sorry for the delay in response.