chanind / frame-semantic-transformer

Frame Semantic Parser based on T5 and FrameNet
https://chanind.github.io/frame-semantic-transformer
MIT License
54 stars 10 forks source link

Can normal mortals use this too? #9

Closed sdspieg closed 1 year ago

sdspieg commented 1 year ago

This looks quite fascinating, but when I try to run it on some of our corpora (as opposed to just some sentences through your demo), it always throws errors that neither your documentation nor chatGPT can (so far) solve. Therefore: could you share some Jupyter Notebook or Google Colab that would take some text field/column from a json or csv-file as an input and would generate semantic frames? Thanks!

chanind commented 1 year ago

Could you give an example of the code you're running and the error you're getting? What are you doing to load the JSON or CSV?

chanind commented 1 year ago

I made a colab demonstrating how to use the library here: Open In Colab.

This colab is just a copy/paste of the contents of the docs page for this repo though, but you can see it does work and it is easy to use.

It sounds like your issue is that you're struggling to parse CSV or JSON with Python, which is out of the scope of this library. There are a lot of resources online about how to parse JSON or CSV with Python, so I won't try to diagnose here without seeing your specific code. Whatever you want to parse using frame-semantic-transformer, you just need to get it into a Python string, and then pass that string into this library.

I'm going to close this issue now. If you continue to have problems, you can open a new issue. However, please recognize the following:

sdspieg commented 1 year ago

I am sorry you took offense at my title. It really just referred to the fact that I am not a CS person, and I am deeply in awe of the NLP-pros. But so here is my (python) code - and I did manage to (sort of) get it to work now:

import pandas as pd
from frame_semantic_transformer import FrameSemanticTransformer
from tqdm import tqdm

# Limit the DataFrame to the first 10 rows
df = df.head(10)

# Initialize the frame transformer
frame_transformer = FrameSemanticTransformer()

# Loop through each sentence in the DataFrame and display a progress bar
for sentence in tqdm(df['resolved_text_lemmatized']):
    result = frame_transformer.detect_frames(sentence)
    print(f"Results found in: {result.sentence}")
    for frame in result.frames:
        print(f"FRAME: {frame.name}")
        for element in frame.frame_elements:
            print(f"  {element.name}: {element.text}")
sdspieg commented 1 year ago

Ok - having played around with it a bit more, I can confirm that this is a) really useable even for non-experts (like myself - at least with the infinitely patient assistance of chatGPT); and b) quite impressive. And that's just using the default settings. Now I'm not sure whether it's ok to ask questions like this in a 'closed' topic but I'll run the risk. I have a dataframe of 167,413 sentences to which I applied (so far just intrasentential) coreference resolution in a resolved_text column. That's the column I fed into FrameSemanticTransformer. I let the code run on the first 1000 sentences, which took a little over 4 hours but really did generate very useful insights. But so running the code on my entire corpus would - by my calculations - run almost 5 years...
So, in one word: help? I could possibly run this on a machine with a Ryzen AMD Ryzen 7 3700X CPU and 96GB of RAM as well as an Nvidia GeForce RTX™ 4090 GPU with 24G - would this code work on that? And how would I make sure that my code would squeeze as much our of that system as possible? [And again - please forgive my clumsiness in expressing myself in the right terms, but I AM trying and I would genuinely love to get this to work!].

chanind commented 1 year ago

Your pandas code looks like it should work. 1000 sentences in 4 hours is extremely slow. Are you running this on a GPU or a CPU? If it's CPU I could imagine it might be that slow, but any decent Nvidia GPU with CUDA should be much much faster than that. If you're running on a 4090 that should easily handle this task with no problems, that's a super powerful GPU.

I also realize there's not a method currently to batch process multiple sentences in parallel which would help speed things up further on a GPU. I'll work on adding support for that in the meantime as well.

sdspieg commented 1 year ago

Great! I have just purchased a 4090 (these GPU prices are just nuts!), but have also just realized that it won't fit in my computer. So I'll have to get myself another case tomorrow too :)

But so the code I have will automatically use CUDA? Or do I have to add something?

At any rate - thanks much for that response and we're looking forward to batch processing as well, because we have some corpora that are quite a bit bigger than that...

chanind commented 1 year ago

You'll need to make sure you have CUDA and CUDNN installed, and the correct PyTorch version installed for your version of CUDA. I haven't ever done this before, so you'll need to figure out how to do that. There should be lots of tutorials online though. If you do that then this library should automatically detect your GPU and use it.

chanind commented 1 year ago

v0.6.0 adds a detect_frames_bulk() method which you can use to do batch processing of sentences. There's also a batch_size param you can pass when creating a FrameSemanticTransformer instance to determine how much is processed per batch. If you have a GPU setup, you can try experimenting with the batch size to maximize the amount of sentence the GPU can process in parallel before it runs out of memory. There's also a use_gpu param you can pass if you want to force it to use a GPU, or throw an error if it can't find a GPU. For example:

parser = FrameSemanticTransformer(use_gpu=True, batch_size=16)
sentences = [
  "This is the first sentence.",
  "This is the next sentence.",
  ...
]
results = parser.detect_frames_bulk(sentences)