JohnGiorgi / seq2rel-ds

This is a companion repository to seq2rel (https://github.com/JohnGiorgi/seq2rel) which aims to make it easy to generate training data.
5 stars 1 forks source link

Entity hints as prompts #42

Closed JohnGiorgi closed 3 years ago

JohnGiorgi commented 3 years ago

Overview

This PR changes the way entity hints are inserted into the text. Briefly, they are now prepended to the source text, instead of being used to wrap entities within the source text.

Example

Take the string

Variants in the estrogen receptor alpha (ESR1) gene and its mRNA contribute to risk for schizophrenia.

Before, we inserted entity hints like:

Variants in the @START_GENE@ estrogen receptor alpha ; 0 @END_GENE@ ( @START_GENE@ ESR1 ; 0 @END_GENE@ ) gene and its mRNA contribute to risk for @START_DISEASE@ schizophrenia @END_DISEASE@ .

Now, we do as follows:

estrogen receptor alpha ; ESR1 @GENE@ schizophrenia @DISEASE@ @HINTS@ Variants in the estrogen receptor alpha (ESR1) gene and its mRNA contribute to risk for schizophrenia.

Where @HINTS@ is a new special token.

This leads to a large boost (+2%) on the CDR corpus. I suspect this is for some combination of the following reasons:

  1. Leaves the source text unadulterated.
  2. A defined region the model can copy from.
  3. Entity prompts look more like the target string.

Note, I found that to obtain this performance I needed to reduce the number of training epochs by about 40% when compared to the old style. This is of course good news as train times are significantly reduced when using these new prompts, but it is something to watch out for.

BC5CDR: 50 --> 30 epochs. GDA: 20 -> 15 epochs.

TODO

Other changes