All the preprocess commands should first convert their datasets to the PubTator format, so that subsequent processing can all use the same functions and methods we have written. This will simplify things like adding entity hints, computing corpus statistics, etc. The general steps are:
Rename the PubtatorAnnotation schema to something more general.
For each of the preprocess commands, first convert the corpus to PubTator format. Then use the existing parse_pubtator function to convert it to the soon to be renamed PubtatorAnnotation schema.
All the
preprocess
commands should first convert their datasets to the PubTator format, so that subsequent processing can all use the same functions and methods we have written. This will simplify things like adding entity hints, computing corpus statistics, etc. The general steps are:PubtatorAnnotation
schema to something more general.preprocess
commands, first convert the corpus to PubTator format. Then use the existingparse_pubtator
function to convert it to the soon to be renamedPubtatorAnnotation
schema.Commands to update