Closed striebel closed 1 year ago
This looks great! I'm also not sure why the automated actions aren't running, but will look into that separately. Probably there's a setting somewhere that needs to be enabled. What are some example sentences with <unk>
tokens that motivate this change?
I'm working with anonymized data that includes sentences like
The people you refer to (<PERSON>, <PERSON>, <PERSON>) were never involved.
The opening angle bracket <
in each <PERSON>
gets tokenized as <unk>
. For trigger identification for this sentence the transformer returns
The * people you refer to (PERSON>, PERSON>, PERSON>) were never * involved.
With my changes I added this sentence as a test case for the marked_string_to_locs
function.
In my data there are also cases when you have a longer sequence of anonymized persons like
The people you refer to (<PERSON>, <PERSON>, <PERSON>, <PERSON>, <PERSON>, <PERSON>) were never involved.
and in this case for trigger identification the transformer returns
The * people you refer to (PERSON>, PERSON>, PERSON>, PERSON>, PERSON>, PERSON>, PERSON>, PERSON>, PERSON>, PERSON>, PERSON>, PERSON>) were never * involved.
Thanks for making and sharing this library! This pull request contains a feature that I needed and wrote up, in case you care to merge it into the larger project:
The feature adds sophistication when parsing the trigger-word marks (
*
) in the output string generated by the trigger-identification task. Instead of skipping parsing sentences altogether for which the trigger-identification task doesn't produce exactly the same text as the input sentence, we find an alignment between the original text and the trigger-marked text and transfer as many trigger locations as possible to the original sentence. The alignment technique also correctly transfers all the trigger locations in the common case when the two texts are exactly the same.This update allows you to avoid getting a lot of sentences returned with an empty
DetectFramesResult
if the sentences that you're parsing frequently contain substrings that are tokenized as<unk>
by the T5 tokenizer.I ran
black
,flake8
, andpytest
locally, but I'm not exactly sure how to run them as GitHub Actions connected to this pull request.