chanind / frame-semantic-transformer

Frame Semantic Parser based on T5 and FrameNet
https://chanind.github.io/frame-semantic-transformer
MIT License
51 stars 10 forks source link

IndexError for some texts using t5-small #22

Open cbjrobertson opened 1 year ago

cbjrobertson commented 1 year ago

For some texts, using t5-small, detect_frames returns a cryptic index error.

MRE:

#the two texts below differ *only* in that text_2 has a trailing period

text_1 = "Well, I came out and put on these and run around. And uh I like to run. I see the car. And I don't know, I uh. Yeah they're over there is going to put up and have a party uh. It's a child getting water. And she's she's going to making something nice. That's my father. He goes everywhere on that one. And that's people a lot of people like to go out on those"

text_2 = "Well, I came out and put on these and run around. And uh I like to run. I see the car. And I don't know, I uh. Yeah they're over there is going to put up and have a party uh. It's a child getting water. And she's she's going to making something nice. That's my father. He goes everywhere on that one. And that's people a lot of people like to go out on those."

from frame_semantic_transformer import FrameSemanticTransformer
fst_small = FrameSemanticTransformer("small")
fst_base = FrameSemanticTransformer("base")

>>>fst_small.detect_frames(text_1)
>>> DetectFramesResult(sentence="Well, I came out and put on these and run around. And uh I like to run. I see the car. And I don't know, I uh. Yeah they're over there is going to put up and have a party uh. It's a child getting water. And she's she's going to making something nice. That's my father. He goes everywhere on that one. And that's people a lot of people like to go out on those", trigger_locations=[8, 38, 59, 67, 99, 182, 261, 335], frames=[FrameResult(name='Arriving', trigger_location=8, frame_elements=[FrameElementResult(name='Theme', text='I'), FrameElementResult(name='Goal', text='out')]), FrameResult(name='Self_motion', trigger_location=38, frame_elements=[FrameElementResult(name='Self_mover', text='I'), FrameElementResult(name='Goal', text='around')]), FrameResult(name='Likelihood', trigger_location=59, frame_elements=[FrameElementResult(name='Hypothetical_event', text='I'), FrameElementResult(name='Hypothetical_event', text='to run')]), FrameResult(name='Self_motion', trigger_location=67, frame_elements=[FrameElementResult(name='Self_mover', text='I')]), FrameResult(name='Awareness', trigger_location=99, frame_elements=[FrameElementResult(name='Cognizer', text='I'), FrameElementResult(name='Content', text='I')]), FrameResult(name='People_by_age', trigger_location=182, frame_elements=[FrameElementResult(name='Person', text='child')]), FrameResult(name='Kinship', trigger_location=261, frame_elements=[FrameElementResult(name='Ego', text='my'), FrameElementResult(name='Alter', text='father')]), FrameResult(name='Likelihood', trigger_location=335, frame_elements=[FrameElementResult(name='Hypothetical_event', text='people a lot of people'), FrameElementResult(name='Hypothetical_event', text='to go out on those')])])

>>>fst_small.detect_frames(text_2)
>>>---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
/tmp/ipykernel_1767205/3259689296.py in <module>
----> 1 fst_small.detect_frames(text_2)

~/anaconda3/envs/ktrain_loo/lib/python3.7/site-packages/frame_semantic_transformer/FrameSemanticTransformer.py in detect_frames(self, sentence)
    167         base_sentence, trigger_locs = self._identify_triggers(sentence)
    168         # next detect frames for each trigger
--> 169         frames = self._classify_frames(base_sentence, trigger_locs)
    170 
    171         frame_and_locs = [

~/anaconda3/envs/ktrain_loo/lib/python3.7/site-packages/frame_semantic_transformer/FrameSemanticTransformer.py in _classify_frames(self, sentence, trigger_locs)
    126             frame_classification_tasks, chunk_size=self.max_batch_size
    127         ):
--> 128             batch_results = self._batch_predict([task.get_input() for task in batch])
    129             for preds, frame_task in zip(
    130                 chunk_list(batch_results, self.predictions_per_sample),

~/anaconda3/envs/ktrain_loo/lib/python3.7/site-packages/frame_semantic_transformer/FrameSemanticTransformer.py in <listcomp>(.0)
    126             frame_classification_tasks, chunk_size=self.max_batch_size
    127         ):
--> 128             batch_results = self._batch_predict([task.get_input() for task in batch])
    129             for preds, frame_task in zip(
    130                 chunk_list(batch_results, self.predictions_per_sample),

~/anaconda3/envs/ktrain_loo/lib/python3.7/site-packages/frame_semantic_transformer/data/tasks/FrameClassificationTask.py in get_input(self)
     25 
     26     def get_input(self) -> str:
---> 27         potential_frames = get_possible_frames_for_trigger_bigrams(self.trigger_bigrams)
     28         return f"FRAME {' '.join(potential_frames)} : {self.trigger_labeled_text}"
     29 

~/anaconda3/envs/ktrain_loo/lib/python3.7/site-packages/frame_semantic_transformer/data/tasks/FrameClassificationTask.py in trigger_bigrams(self)
     44         pre_trigger_tokens = self.text[: self.trigger_loc].split()
     45         trigger_and_after_tokens = self.text[self.trigger_loc :].split()
---> 46         trigger = trigger_and_after_tokens[0]
     47         post_trigger_tokens = trigger_and_after_tokens[1:]
     48         bigrams: list[list[str]] = []

IndexError: list index out of range

>>> fst_base.detect_frames(text_1)
>>> DetectFramesResult(sentence="Well, I came out and put on these and run around. And uh I like to run. And I don't know, I uh. Yeah they're over there is going to put up and have a party uh. It's a child getting water. And she's she's going to make something nice. That's my father. He goes everywhere on that one. And that's people a lot of people like to go out on those", trigger_locations=[8, 38, 59, 67, 84, 173, 228, 244, 295, 302, 304, 311, 318, 326], frames=[FrameResult(name='Arriving', trigger_location=8, frame_elements=[FrameElementResult(name='Theme', text='I'), FrameElementResult(name='Goal', text='out')]), FrameResult(name='Self_motion', trigger_location=38, frame_elements=[FrameElementResult(name='Self_mover', text='I'), FrameElementResult(name='Path', text='around')]), FrameResult(name='Experiencer_focus', trigger_location=59, frame_elements=[FrameElementResult(name='Experiencer', text='I'), FrameElementResult(name='Content', text='to run')]), FrameResult(name='Self_motion', trigger_location=67, frame_elements=[FrameElementResult(name='Self_mover', text='I')]), FrameResult(name='Awareness', trigger_location=84, frame_elements=[FrameElementResult(name='Cognizer', text='I')]), FrameResult(name='Getting', trigger_location=173, frame_elements=[FrameElementResult(name='Recipient', text='a child'), FrameElementResult(name='Theme', text='water')]), FrameResult(name='Stimulus_focus', trigger_location=228, frame_elements=[FrameElementResult(name='Stimulus', text='something')]), FrameResult(name='Kinship', trigger_location=244, frame_elements=[FrameElementResult(name='Ego', text='my'), FrameElementResult(name='Alter', text='father')]), FrameResult(name='People', trigger_location=295, frame_elements=[FrameElementResult(name='Person', text='people')]), FrameResult(name='Quantified_mass', trigger_location=302, frame_elements=[FrameElementResult(name='Quantity', text='a lot'), FrameElementResult(name='Individuals', text='of people')]), FrameResult(name='Quantified_mass', trigger_location=304, frame_elements=[FrameElementResult(name='Quantity', text='a lot'), FrameElementResult(name='Individuals', text='of people')]), FrameResult(name='People', trigger_location=311, frame_elements=[FrameElementResult(name='Person', text='people')]), FrameResult(name='Desiring', trigger_location=318, frame_elements=[FrameElementResult(name='Experiencer', text='people'), FrameElementResult(name='Event', text='to go out on those')]), FrameResult(name='Motion', trigger_location=326, frame_elements=[FrameElementResult(name='Theme', text='people'), FrameElementResult(name='Goal', text='out on those')])])

>>> fst_base.detect_frames(text_2)
>>>DetectFramesResult(sentence="Well, I came out and put on these and run around. And uh I like to run. And I don't know, I uh. Yeah they're over there is going to put up and have a party uh. It's a child getting water. And she's she's going to make something nice. That's my father. He goes everywhere on that one. And that's people a lot of people like to go out on those.", trigger_locations=[8, 38, 59, 67, 84, 173, 228, 244, 295, 302, 304, 311, 318, 326], frames=[FrameResult(name='Arriving', trigger_location=8, frame_elements=[FrameElementResult(name='Theme', text='I'), FrameElementResult(name='Goal', text='out')]), FrameResult(name='Self_motion', trigger_location=38, frame_elements=[FrameElementResult(name='Self_mover', text='I'), FrameElementResult(name='Path', text='around')]), FrameResult(name='Experiencer_focus', trigger_location=59, frame_elements=[FrameElementResult(name='Experiencer', text='I'), FrameElementResult(name='Content', text='to run')]), FrameResult(name='Self_motion', trigger_location=67, frame_elements=[FrameElementResult(name='Self_mover', text='I')]), FrameResult(name='Awareness', trigger_location=84, frame_elements=[FrameElementResult(name='Cognizer', text='I')]), FrameResult(name='Getting', trigger_location=173, frame_elements=[FrameElementResult(name='Recipient', text='a child'), FrameElementResult(name='Theme', text='water')]), FrameResult(name='Stimulus_focus', trigger_location=228, frame_elements=[FrameElementResult(name='Stimulus', text='something')]), FrameResult(name='Kinship', trigger_location=244, frame_elements=[FrameElementResult(name='Ego', text='my'), FrameElementResult(name='Alter', text='father')]), FrameResult(name='People', trigger_location=295, frame_elements=[FrameElementResult(name='Person', text='people')]), FrameResult(name='Quantified_mass', trigger_location=302, frame_elements=[FrameElementResult(name='Quantity', text='a lot'), FrameElementResult(name='Individuals', text='of people')]), FrameResult(name='Quantified_mass', trigger_location=304, frame_elements=[FrameElementResult(name='Quantity', text='a lot'), FrameElementResult(name='Individuals', text='of people')]), FrameResult(name='People', trigger_location=311, frame_elements=[FrameElementResult(name='Person', text='people')]), FrameResult(name='Desiring', trigger_location=318, frame_elements=[FrameElementResult(name='Experiencer', text='people'), FrameElementResult(name='Event', text='to go out on those')]), FrameResult(name='Motion', trigger_location=326, frame_elements=[FrameElementResult(name='Theme', text='people'), FrameElementResult(name='Source', text='out'), FrameElementResult(name='Goal', text='on those')])])

Seems like a bug to me!

chanind commented 1 year ago

I'm having trouble reproducing the exact error, but I suspect it's related to the 512 token limit of T5, which is shared with the input text and the command to T5. In general, it's not a good idea to put more than a single sentence at a time into the sentence param, since during training the model is only ever trained on single sentences. I don't know how it will perform if it's given multiple sentences like that since it's out of domain of the training regimen. This is also a problem with the documentation, since I don't think this is highlighted properly

cbjrobertson commented 1 year ago

Your point about distribution shift aside, I don't think it can be text length.

Firstly because:

len(text_2.split())
>>> 76

This is way under the T5 input limit.

Secondly because:

text_3 = "Well, I came out and put on these and run around. And uh I like to run. I see the car. And I don't know, I uh. Yeah they're over there is going to put up and have a party uh. It's a child getting water. And she's she's going to making something nice. That's my father. He goes everywhere on that one. And that's people a lot of people like to go out on those and other stuff like that."
fst_small.detect_frames(text_3)
>>> DetectFramesResult(sentence="Well, I came out and put on these and run around. And uh I like to run. I see the car. And I don't know, I uh. Yeah they're over there is going to put up and have a party uh. It's a child getting water. And she's she's going to making something nice. That's my father. He goes everywhere on that one. And that's people a lot of people like to go out on those and other stuff like that", trigger_locations=[8, 38, 59, 67, 99, 182, 261, 335, 363], frames=[FrameResult(name='Arriving', trigger_location=8, frame_elements=[FrameElementResult(name='Theme', text='I'), FrameElementResult(name='Goal', text='out')]), FrameResult(name='Self_motion', trigger_location=38, frame_elements=[FrameElementResult(name='Self_mover', text='I'), FrameElementResult(name='Goal', text='around')]), FrameResult(name='Likelihood', trigger_location=59, frame_elements=[FrameElementResult(name='Hypothetical_event', text='I'), FrameElementResult(name='Hypothetical_event', text='to run')]), FrameResult(name='Self_motion', trigger_location=67, frame_elements=[FrameElementResult(name='Self_mover', text='I')]), FrameResult(name='Awareness', trigger_location=99, frame_elements=[FrameElementResult(name='Cognizer', text='I'), FrameElementResult(name='Content', text='I')]), FrameResult(name='People_by_age', trigger_location=182, frame_elements=[FrameElementResult(name='Person', text='child')]), FrameResult(name='Kinship', trigger_location=261, frame_elements=[FrameElementResult(name='Ego', text='my'), FrameElementResult(name='Alter', text='father')]), FrameResult(name='Likelihood', trigger_location=335, frame_elements=[FrameElementResult(name='Hypothetical_event', text='people a lot of people'), FrameElementResult(name='Hypothetical_event', text='to go out on those and other stuff like that')]), FrameResult(name='Increment', trigger_location=363, frame_elements=[FrameElementResult(name='Class', text='stuff')])])

In other words, text_3 does not return an error, even though it is longer than text_2 (i.e. it's the same but "and other stuff like that." is added to the end).

It may be helpful to know that I get the same error in the same line of code when I pass an empty string to detect_frames:

fst = FrameSemanticTransformer("small")
fst.detect_frames("")
>>>---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
/tmp/ipykernel_2486399/937327281.py in <module>
      1 fst = FrameSemanticTransformer("small")
----> 2 fst.detect_frames("")
      3 

~/anaconda3/envs/ktrain_loo/lib/python3.7/site-packages/frame_semantic_transformer/FrameSemanticTransformer.py in detect_frames(self, sentence)
    167         base_sentence, trigger_locs = self._identify_triggers(sentence)
    168         # next detect frames for each trigger
--> 169         frames = self._classify_frames(base_sentence, trigger_locs)
    170 
    171         frame_and_locs = [

~/anaconda3/envs/ktrain_loo/lib/python3.7/site-packages/frame_semantic_transformer/FrameSemanticTransformer.py in _classify_frames(self, sentence, trigger_locs)
    126             frame_classification_tasks, chunk_size=self.max_batch_size
    127         ):
--> 128             batch_results = self._batch_predict([task.get_input() for task in batch])
    129             for preds, frame_task in zip(
    130                 chunk_list(batch_results, self.predictions_per_sample),

~/anaconda3/envs/ktrain_loo/lib/python3.7/site-packages/frame_semantic_transformer/FrameSemanticTransformer.py in <listcomp>(.0)
    126             frame_classification_tasks, chunk_size=self.max_batch_size
    127         ):
--> 128             batch_results = self._batch_predict([task.get_input() for task in batch])
    129             for preds, frame_task in zip(
    130                 chunk_list(batch_results, self.predictions_per_sample),

~/anaconda3/envs/ktrain_loo/lib/python3.7/site-packages/frame_semantic_transformer/data/tasks/FrameClassificationTask.py in get_input(self)
     25 
     26     def get_input(self) -> str:
---> 27         potential_frames = get_possible_frames_for_trigger_bigrams(self.trigger_bigrams)
     28         return f"FRAME {' '.join(potential_frames)} : {self.trigger_labeled_text}"
     29 

~/anaconda3/envs/ktrain_loo/lib/python3.7/site-packages/frame_semantic_transformer/data/tasks/FrameClassificationTask.py in trigger_bigrams(self)
     44         pre_trigger_tokens = self.text[: self.trigger_loc].split()
     45         trigger_and_after_tokens = self.text[self.trigger_loc :].split()
---> 46         trigger = trigger_and_after_tokens[0]
     47         post_trigger_tokens = trigger_and_after_tokens[1:]
     48         bigrams: list[list[str]] = []

IndexError: list index out of range

It works as expected with the "base" model:

fst = FrameSemanticTransformer("base")
fst.detect_frames("")
>>> DetectFramesResult(sentence='TRIGGER:', trigger_locations=[0], frames=[FrameResult(name='Triggering', trigger_location=0, frame_elements=[FrameElementResult(name='', text='N/A')])])
chanind commented 1 year ago

That is strange that it's happening on an empty string as well. I still can't reproduce this error for some reason. It seems like the small model must be outputting some strange string that's breaking processing of subtasks, I wish I could see what string it is. I'm not sure if it's related, but the library only supports Python 3.8+ currently, so I'm surprised it was possible to install it on Python 3.7.

For the length of the input, the reason it could spill over the 512 token limit is that the subtask prompts can also be pretty long. The library tries to include any info it thinks might be helpful for T5 to make the classification as part of the task definition. So, for example, one of the subtask inputs for text_2 looks like the following when I try it locally:

ARGS Perception_experience | Perceiver_passive Phenomenon Body_part Location_of_protagonist Direction Depictive State Manner Degree Means Time Duration Contrastive_context Concessive Circumstances Frequency Place Ground Obscuring_medium : Well, I came out and put on these and run around. And uh I like to run. I * see the car. And I don't know, I uh. Yeah they're over there is going to put up and have a party uh. It's a child getting water. And she's she's going to making something nice. That's my father. He goes everywhere on that one. And that's people a lot of people like to go out on those.

Here, it's trying to include all the possible argument names that the Perception_experience frame might have as part of the text it sends to T5 to make the task easier for T5. The length of this prompt prefix depends on the frame that's predicted, so it's possible that it might spill over the 512 limit seemingly randomly depending on which frame it's predicting if the input text is long enough.

That being said, if this is happening on an empty string input then something is definitely wrong. Are you able to see what strings are being passed to the model and what it's returning? It might require hacking print statements into the library for debugging. Or maybe we can add a debug param to the library to make this sort of thing easier 🤔

chanind commented 1 year ago

Actually, it's possible that what's going on is that you're on an old version of the library, since old versions did support Python 3.7. Can you confirm what version of the library you have installed?