We made a couple of simplifying decisions, outlined here:
The dataset is provided in sentence-length, paragraph-length, and full-text length documents. We use the paragraph-length text, but this could be updated.
There are no relation annotations on the paragraph-level, so we use the validation set as a test set, and hold 10% of the train set out for a new validation set.
This is documented by the dgm command and will come up if you call seq2rel-ds preprocess dgm main --help.
This PR adds a command,
dgm
that preprocesses the drug-gene-mutation corpus from Document-Level N-ary Relation Extraction with Multiscale Representation Learning.We made a couple of simplifying decisions, outlined here:
This is documented by the
dgm
command and will come up if you callseq2rel-ds preprocess dgm main --help
.