Align trigger-marked sentence to original sentence

striebel commented 1 year ago

Thanks for making and sharing this library! This pull request contains a feature that I needed and wrote up, in case you care to merge it into the larger project:

The feature adds sophistication when parsing the trigger-word marks (*) in the output string generated by the trigger-identification task. Instead of skipping parsing sentences altogether for which the trigger-identification task doesn't produce exactly the same text as the input sentence, we find an alignment between the original text and the trigger-marked text and transfer as many trigger locations as possible to the original sentence. The alignment technique also correctly transfers all the trigger locations in the common case when the two texts are exactly the same.

This update allows you to avoid getting a lot of sentences returned with an empty DetectFramesResult if the sentences that you're parsing frequently contain substrings that are tokenized as <unk> by the T5 tokenizer.

I ran black, flake8, and pytest locally, but I'm not exactly sure how to run them as GitHub Actions connected to this pull request.

chanind commented 1 year ago

This looks great! I'm also not sure why the automated actions aren't running, but will look into that separately. Probably there's a setting somewhere that needs to be enabled. What are some example sentences with <unk> tokens that motivate this change?

striebel commented 1 year ago

I'm working with anonymized data that includes sentences like

The people you refer to (<PERSON>, <PERSON>, <PERSON>) were never involved.

The opening angle bracket < in each <PERSON> gets tokenized as <unk>. For trigger identification for this sentence the transformer returns

The * people you refer to (PERSON>, PERSON>, PERSON>) were never * involved.

With my changes I added this sentence as a test case for the marked_string_to_locs function.

In my data there are also cases when you have a longer sequence of anonymized persons like

The people you refer to (<PERSON>, <PERSON>, <PERSON>, <PERSON>, <PERSON>, <PERSON>) were never involved.

and in this case for trigger identification the transformer returns

The * people you refer to (PERSON>, PERSON>, PERSON>, PERSON>, PERSON>, PERSON>, PERSON>, PERSON>, PERSON>, PERSON>, PERSON>, PERSON>) were never * involved.

chanind / frame-semantic-transformer

Align trigger-marked sentence to original sentence #19