clulab / reach

Reach Biomedical Information Extraction
Other
96 stars 39 forks source link

Output format: `training-output` to build supervised models #801

Open enoriega opened 11 months ago

enoriega commented 11 months ago

Summary

Added a new output format suitable to train classifiers using a python pipeline. It "flattens" activations and regulations and creates a json array with the tokens, spans, label and polarity for each event.

Example

[
  {
    "sentence_tokens" : [ "Notably", ",", "overexpressing", "MafB", "in", "human", "beta-cell", "lines", "(", "beta", "TC3", "cells", ")", "resulted", "in", "increased", "cell", "proliferation", "by", "upregulating", "important", "cell", "cycle", "regulators", ",", "like", "cyclin", "D2", "and", "cyclin", "B", "(", "28", ")", "." ],
    "event_indices" : [ 16, 17, 18, 19, 20, 21, 22, 23 ],
    "type" : "Positive_activation",
    "polarity" : true,
    "controller_indices" : [ 16, 17, 18 ],
    "controlled_indices" : [ 21, 22, 23 ],
    "trigger_indices" : [ 19, 20 ]
  }, {
    "sentence_tokens" : [ "In", "vivo", "glucose", "stimulated", "insulin", "secretion", "(", "GSIS", ")", "experiment", ".." ],
    "event_indices" : [ 2, 3, 4, 5, 6 ],
    "type" : "Positive_activation",
    "polarity" : true,
    "controller_indices" : [ 2, 3 ],
    "controlled_indices" : [ 4, 5, 6 ],
    "trigger_indices" : [ 3, 4 ]
  }
]
kwalcock commented 11 months ago

@enoriega, this is being built for both Scala 2.11 and 2.12. The earlier version does not like trailing/dangling commas like the ones in TrainingDataExporter, so it doesn't compile. One can use ++compile or ++2.11.12 and then compile to test.

kwalcock commented 11 months ago

That TrainingDataExporter still needs a comma removed at line 76 in order to work on Scala 2.11.