JohnGiorgi / seq2rel-ds

This is a companion repository to seq2rel (https://github.com/JohnGiorgi/seq2rel) which aims to make it easy to generate training data.
5 stars 1 forks source link

Standardize corpus preprocessing #24

Closed JohnGiorgi closed 2 years ago

JohnGiorgi commented 3 years ago

All the preprocess commands should first convert their datasets to the PubTator format, so that subsequent processing can all use the same functions and methods we have written. This will simplify things like adding entity hints, computing corpus statistics, etc. The general steps are:

  1. Rename the PubtatorAnnotation schema to something more general.
  2. For each of the preprocess commands, first convert the corpus to PubTator format. Then use the existing parse_pubtator function to convert it to the soon to be renamed PubtatorAnnotation schema.

Commands to update