clulab / reach

Reach Biomedical Information Extraction
Other
96 stars 39 forks source link

Triggers aren't converted to Bio- or Coref-Mentions and don't survive serialization #785

Open kwalcock opened 1 year ago

kwalcock commented 1 year ago

It looks very much like any BioEventMention or CorefEventMention that is deserialized is forced to have a BioTextBoundMention or CorefTextBoundMention as a trigger: https://github.com/clulab/reach/blob/c7397a4b979454854f0c4a39098d2c2bb31363a3/main/src/main/scala/org/clulab/reach/mentions/serialization/json/JSONSerializer.scala#L91

https://github.com/clulab/reach/blob/c7397a4b979454854f0c4a39098d2c2bb31363a3/main/src/main/scala/org/clulab/reach/mentions/serialization/json/JSONSerializer.scala#L198

However, when the BioEventMention or CorefEventMention is created, the TextBoundMention in the trigger is not converted similarly:

https://github.com/clulab/reach/blob/c7397a4b979454854f0c4a39098d2c2bb31363a3/main/src/main/scala/org/clulab/reach/mentions/package.scala#L28

https://github.com/clulab/reach/blob/c7397a4b979454854f0c4a39098d2c2bb31363a3/main/src/main/scala/org/clulab/reach/mentions/package.scala#L65

This might cause many problems, but one is that a serialized and then deserialized Bio- or CorefEventMention will have its trigger change type from a simple TextBoundMention to a more specific one so that the round trip is essentially invalid. A newly enabled but old test confirms this.

Those lines should probably be changed to

          m.trigger.toBioMention.asInstanceOf[BioTextBoundMention],

and

          m.trigger.toCorefMention.asInstanceOf[CorefTextBoundMention],

It would be good for others who know more about this project to confirm the intention of the original design and consider whether the change would cause problems. Thanks.

FYI @enoriega, @MihaiSurdeanu

kwalcock commented 1 year ago

For instance, and this was a concern in Eidos, the original TextBoundMention may be shared between two EventMentions and something may be depending on that reference equality. That single trigger will be converted into two separate CorefTextBoundMention copies which might be manipulated independently, for instance, by serializing them both. They will have the same ID, leading to duplicate keys in some database, for example.

kwalcock commented 1 year ago

Similarly, it doesn't look like the paths are converted. I get something like a CorefEventMention but the paths it has are simple BioTextBoundMentions. This can mean that the IDs don't match up. It looks like sometimes the same mention is both an argument as a CorefTextBoundMention and in the path as a BioTextBoundMention.