Overview

This PR changes the way entity hints are inserted into the text. Briefly, they are now prepended to the source text, instead of being used to wrap entities within the source text.

Example

Take the string

Variants in the estrogen receptor alpha (ESR1) gene and its mRNA contribute to risk for schizophrenia.

Before, we inserted entity hints like:

Variants in the @START_GENE@ estrogen receptor alpha ; 0 @END_GENE@ ( @START_GENE@ ESR1 ; 0 @END_GENE@ ) gene and its mRNA contribute to risk for @START_DISEASE@ schizophrenia @END_DISEASE@ .

Now, we do as follows:

estrogen receptor alpha ; ESR1 @GENE@ schizophrenia @DISEASE@ @HINTS@ Variants in the estrogen receptor alpha (ESR1) gene and its mRNA contribute to risk for schizophrenia.

Where @HINTS@ is a new special token.

This leads to a large boost (+2%) on the CDR corpus. I suspect this is for some combination of the following reasons:

Leaves the source text unadulterated.
A defined region the model can copy from.
Entity prompts look more like the target string.

Note, I found that to obtain this performance I needed to reduce the number of training epochs by about 40% when compared to the old style. This is of course good news as train times are significantly reduced when using these new prompts, but it is something to watch out for.

BC5CDR: 50 --> 30 epochs. GDA: 20 -> 15 epochs.

TODO

[x] Confirm that gains are seen on GDA

Other changes

:fire: Remove all ADE corpus code.
♻️ Break a couple of sorting-related helper functions from util to sorting_utils.py
♻️ Move special tokens from util to special_tokens.py

JohnGiorgi / seq2rel-ds

Entity hints as prompts #42

Overview

Example

TODO

Other changes