chanind / frame-semantic-transformer

Frame Semantic Parser based on T5 and FrameNet
https://chanind.github.io/frame-semantic-transformer
MIT License
51 stars 10 forks source link

Getting trigger word (not just location) #26

Open ruckc opened 1 year ago

ruckc commented 1 year ago

Looking at the API, we get a trigger_location in inference result. Any chance we can expand this to include the actual trigger word?

Along with this is it possible to get the frame element locations? A sentence may have "brown" multiple times in different contexts, and it would be helpful to have those locations within the sentence.

Just trying to make a frame visualizer wrapped around this implementation.

dersuchendee commented 12 months ago

Following

chanind commented 11 months ago

Sorry for the slow reply!

I think the trigger word wouldn't be too hard to add, however the difficulty is that with the way this is implemented internally, only the start of the trigger is marked. If the trigger is multiple words long, it's not possible to tell that from just knowing the trigger start location. I think for most cases though the trigger is only 1 word anyway, so adding the word immediately following the trigger location into the output should be an easy thing to implement.

Getting frame element locations is sadly a lot harder with the way this is implemented currently. T5 is trained to just output the trigger element, so if the sentence has the same element word multiple times it's not possible to differentiate which specific instance is being referred to. This was a poor design decision on my part.

Fixing both of these robustly should be possible but will require retraining the models. For getting trigger word if it's multiple words long, we'd need to mark the trigger end location as well as the start location, which shouldn't be too hard. To be able to differentiate between repeated words as frame elements we'd need to do something like the numbering scheme they do in https://arxiv.org/abs/2010.10998, where the sentence is broken up with numbers between each word, and then T5 outputs the range of numbers corresponding to the span covered by the frame element. There's probably some edge cases to think about around escaping actual numbers in the sentence, but I think this should be doable. Both of these fixes aren't trivial though, sadly, but may be worth it to improve the output of the library for your use-case.

ruckc commented 11 months ago

This is the rabbit hole I've been digging into. I realized it is all the T5 just giving a string out. My next step is looking at the training and see if I can retrain with improved outputs.

I do like the iterative Tasks ... it make it fairly simple to follow once I realized what was going on.

chanind commented 11 months ago

if it's helpful, I did most of the training in Google colab. The notebooks I used look like the following: https://colab.research.google.com/drive/1yoUBpqY1TwiqGCD1LuCy6UGs4cJ4pe-6?usp=sharing. If you get the improved frame element stuff working and open a pull request, I can try to get some new models trained and update the library if the performance looks good.

fatihbozdag commented 11 months ago

I've been working on a similar task and managed to align Spacy's token.idx with trigger_location to replace location with the verb. However, it requires reconstructing sentences to insert verbs into the correct position. I did write two scripts for the task, and it seems like working. But aligning depends on which function you use. If it might help someone and also help me to improve my approach, it is basically something like that. Meanwhile, I am having issues with accuracy, particularly with passive structures and complex phrases. Has anyone tried to train with T5-large?

# Create a mapping from sentence to verb lemma and its location
sentence_to_verb_map = {}

# Assume df is your DataFrame
for doc, context in nlp.pipe(df, as_tuples=True):
    for token in doc:
        if token.lemma_ in verb_list:
            sentence = token.sent.text
            adjusted_token_idx = token.idx - token.sent.start_char
            sentence_to_verb_map[sentence] = {'lemma': token.lemma_, 'adjusted_token_idx': adjusted_token_idx}

# Initialize an empty list to store the results
extracted_sentences = []

# Detect frames in bulk for the sentences
sentences_to_analyze = list(sentence_to_verb_map.keys())
frame_results_list = frame_transformer.detect_frames_bulk(sentences_to_analyze)

# Iterate through frame_results_list
for detect_frame_result in frame_results_list:
    sentence = detect_frame_result.sentence
    verb_info = sentence_to_verb_map.get(sentence, {})

    for frame_result in detect_frame_result.frames:
        if frame_result.trigger_location == verb_info.get('adjusted_token_idx', -1):
            frame_elements_with_names = {}

            # Iterate over FrameElements to build the new sentence
            for frame_element_result in frame_result.frame_elements:
                frame_elements_with_names[frame_element_result.name] = frame_element_result.text

            # Insert the verb into the FrameElements
            frame_elements_with_names['Verb'] = verb_info.get('lemma', 'UNKNOWN_VERB')

            # Reconstruct the sentence based on FrameElements and Verb
            reconstructed_sentence = ", ".join(f"{name}: {text}" for name, text in frame_elements_with_names.items())

            # Store the result
            extracted_sentences.append({
                'Sentence': sentence,
                'FrameName': frame_result.name,
                'FrameElements': frame_elements_with_names,
                'ReconstructedSentence': reconstructed_sentence
            })
fatihbozdag commented 11 months ago

An update to the issue: somehow, if you use the 'small' model, trigger_locations align well with Benepar's benepar_en3 model since both are based on T5-small. It is true if you use Spacy's tokenizer, the token offset aligns well with trigger_locations. However, if you prefer to use the 'base' model, then things get complicated, and you may see occasional mismatches.

yzc111 commented 5 months ago

so, is there a method to extract the lemma according to the location?