JohnGiorgi / seq2rel-ds

This is a companion repository to seq2rel (https://github.com/JohnGiorgi/seq2rel) which aims to make it easy to generate training data.
5 stars 1 forks source link

Add DGM #47

Closed JohnGiorgi closed 2 years ago

JohnGiorgi commented 2 years ago

This PR adds a command, dgm that preprocesses the drug-gene-mutation corpus from Document-Level N-ary Relation Extraction with Multiscale Representation Learning.

We made a couple of simplifying decisions, outlined here:

  1. The dataset is provided in sentence-length, paragraph-length, and full-text length documents. We use the paragraph-length text, but this could be updated.
  2. There are no relation annotations on the paragraph-level, so we use the validation set as a test set, and hold 10% of the train set out for a new validation set.

This is documented by the dgm command and will come up if you call seq2rel-ds preprocess dgm main --help.