This PR changes the way entity hints are inserted into the text. Briefly, they are now prepended to the source text, instead of being used to wrap entities within the source text.
Example
Take the string
Variants in the estrogen receptor alpha (ESR1) gene and its mRNA contribute to risk for schizophrenia.
Before, we inserted entity hints like:
Variants in the @START_GENE@ estrogen receptor alpha ; 0 @END_GENE@ ( @START_GENE@ ESR1 ; 0 @END_GENE@ ) gene and its mRNA contribute to risk for @START_DISEASE@ schizophrenia @END_DISEASE@ .
Now, we do as follows:
estrogen receptor alpha ; ESR1 @GENE@ schizophrenia @DISEASE@ @HINTS@ Variants in the estrogen receptor alpha (ESR1) gene and its mRNA contribute to risk for schizophrenia.
Where @HINTS@ is a new special token.
This leads to a large boost (+2%) on the CDR corpus. I suspect this is for some combination of the following reasons:
Leaves the source text unadulterated.
A defined region the model can copy from.
Entity prompts look more like the target string.
Note, I found that to obtain this performance I needed to reduce the number of training epochs by about 40% when compared to the old style. This is of course good news as train times are significantly reduced when using these new prompts, but it is something to watch out for.
BC5CDR: 50 --> 30 epochs.
GDA: 20 -> 15 epochs.
TODO
[x] Confirm that gains are seen on GDA
Other changes
:fire: Remove all ADE corpus code.
♻️ Break a couple of sorting-related helper functions from util to sorting_utils.py
♻️ Move special tokens from util to special_tokens.py
Overview
This PR changes the way entity hints are inserted into the text. Briefly, they are now prepended to the source text, instead of being used to wrap entities within the source text.
Example
Take the string
Before, we inserted entity hints like:
Now, we do as follows:
Where
@HINTS@
is a new special token.This leads to a large boost (+2%) on the CDR corpus. I suspect this is for some combination of the following reasons:
Note, I found that to obtain this performance I needed to reduce the number of training epochs by about 40% when compared to the old style. This is of course good news as train times are significantly reduced when using these new prompts, but it is something to watch out for.
BC5CDR: 50 --> 30 epochs. GDA: 20 -> 15 epochs.
TODO
Other changes
util
tosorting_utils.py
util
tospecial_tokens.py