This PR attempts to better encapsulate the linearization of entities and relations. This is done by adding two new methods, both called to_string, that can be called from a PubtatorCluster or PubtatorAnnotation objects to get the linearized entity or relation strings respectively.
There are several small changes not expected to impact performance:
Coreferent mentions are now joined by spaces and ;. So lithium; li is now lithium ; li
Entity hints are now lowercased, to match the target strings.
And there are changes that may have a small impact on performance:
Removes duplicate relations from training data. Some of these are artifacts from the original dataset, some of them may have been introduced by us.
Better sorting for relations with identical entities, but different head/tail order or relation type. To handle this case, we first sort lexographically, and then sort by entity offset. This gives us a consistent ordering across train/dev/test sets for these examples.
TODO
[x] Check new sorting strategy's impact on performance
Overview
This PR attempts to better encapsulate the linearization of entities and relations. This is done by adding two new methods, both called
to_string
, that can be called from aPubtatorCluster
orPubtatorAnnotation
objects to get the linearized entity or relation strings respectively.There are several small changes not expected to impact performance:
;
. Solithium; li
is nowlithium ; li
And there are changes that may have a small impact on performance:
TODO