Open KennyNg-19 opened 3 years ago
Hi Kenny, Thank you for your interest in our paper and my apologies for the delay in response.
You can use any entity marker or dummy words but please refrain from using some popular words.
In my case, I utilized synthetic words like ENToGENEoMK
. You need to register these words in the vocab. (Please see the next paragraph)
In order to add a custom token to the tokenizer, (for this repo; TensorFlow version) you need to modify vocab.txt
. If you open the vocab.txt
file, you can see the reserved [unused1]
tokens at the beginning. You can replace these tokens with your custom tokens.
I think your code tokenizer.add_tokens(['[E1]', '[/E1]', '[E2]', '[/E2]', '[BLANK]'])
is for HuggingFace framework. Please check https://github.com/dmis-lab/biobert-pytorch for the pytorch-HuggingFace version codes.
Thank you and once again, sorry for the delay in response.
Hi, as a green hand, I would like to ask some naive questions: for fine-tuning on a custom RE dataset with entity marked in advanced,
1) do we need to constrain what kind of entity marker or dummy words used for the BioNLP when marking the entity(e.g. @DISEASE$, [e] some disease [/e])?
2) when preprocessing, do we need add some code for helping the model to tokenized the entity? e.g. if we using [E1] to mark the entity, let the tokenizer knows it:
tokenizer.add_tokens(['[E1]', '[/E1]', '[E2]', '[/E2]', '[BLANK]'])