clulab / tomcat-text

Natural language text processing code for the DARPA ASIST program
3 stars 3 forks source link

Compactify JSON serialization of extractions #273

Open adarshp opened 2 years ago

adarshp commented 2 years ago

I believe that the excessive verbosity of the JSON outputs due to the redundant serialization of the arguments of complex events is hurting the usability of our system. We should compactify our output by removing this redundancy.

Simple/low-level extractions act as the arguments for more complex extractions. For example, in the screenshot below, CriticalVictim and Deictic act as the exists and location arguments for KnowledgeSharing events.

image

The corresponding data.extractions field is quite verbose:

"extractions": [
      {
        "labels": [
          "CriticalVictim",
          "Victim",
          "Entity",
          "Concept"
        ],
        "span": "critical victim",
        "arguments": {},
        "attachments": [],
        "start_offset": 11,
        "end_offset": 26,
        "rule": "critical_victim"
      },
      {
        "labels": [
          "KnowledgeSharing",
          "SimpleActions",
          "Action",
          "EventLike",
          "Concept"
        ],
        "span": "There is a critical victim here",
        "arguments": {
          "exists": [
            {
              "labels": [
                "CriticalVictim",
                "Victim",
                "Entity",
                "Concept"
              ],
              "span": "critical victim",
              "arguments": {},
              "attachments": [],
              "start_offset": 11,
              "end_offset": 26,
              "rule": "critical_victim"
            }
          ],
          "location": [
            {
              "labels": [
                "Deictic",
                "Inferred",
                "Location",
                "EventLike",
                "Concept"
              ],
              "span": "here",
              "arguments": {},
              "attachments": [],
              "start_offset": 27,
              "end_offset": 31,
              "rule": "deictic_detection"
            }
          ]
        },
        "attachments": [],
        "start_offset": 0,
        "end_offset": 31,
        "rule": "existential"
      }
    ]

You can see in the above example that the CriticalVictim mention is serialized twice - once by itself, and once within the KnowledgeSharing event.

Can we instead have data.extractions be an object instead of an array? The object would have integer keys, and the values would be extractions. The integer keys can then be used in the serialization of the complex events and serve as pointers to the simple events. The current method of serialization obfuscates the fact that the CriticalVictim mention in the argument of the KnowledgeSharing event above is the same as the standalone CriticalVictim mention.

It is likely too late to do this for Study 3, but we should seriously consider this for Study 4.

adarshp commented 2 years ago

A couple of additional ideas for compactifying:

jastier commented 2 years ago

Good ideas, I can put those in the Dialog Agent easily enough.

On Tue, Apr 5, 2022 at 1:55 PM Adarsh Pyarelal @.***> wrote:

A couple of additional ideas for compactifying:

  • arguments and attachments fields should not be published if they are empty.
  • the separate start_offset and end_offset fields should be combined into one field, like so:

"offsets": [0, 31]

where the first number is the start offset and the second number is the end offset.

— Reply to this email directly, view it on GitHub https://github.com/clulab/tomcat-text/issues/273#issuecomment-1089335410, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACD6H47G7KHASKQSB3TWBHLVDSSCHANCNFSM5ST4DM6Q . You are receiving this because you are subscribed to this thread.Message ID: @.***>

adarshp commented 2 years ago

Thanks @jastier - let's not worry about it until after the study 3 code freeze on April 20. I would also like to run the proposed changes by the testbed WG before we implement them.

jastier commented 2 years ago

👍

On Tue, Apr 5, 2022 at 2:44 PM Adarsh Pyarelal @.***> wrote:

Thanks @jastier https://github.com/jastier - let's not worry about it until after the study 3 code freeze on April 20. I would also like to run the proposed changes by the testbed WG before we implement them.

— Reply to this email directly, view it on GitHub https://github.com/clulab/tomcat-text/issues/273#issuecomment-1089396428, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACD6H4ZPR364MXCQWFBTUHTVDSX3FANCNFSM5ST4DM6Q . You are receiving this because you were mentioned.Message ID: @.***>