Open Sepideh-Ahmadian opened 1 month ago
Sounds great! I'll delve into the literature you've found as well as any other papers that catch my eye and summarize their relevant points to our goal in the google doc.
In reference to my report in the teams channel, here is my fork with my current implementation for you to take a look at:
Next week (or perhaps the week after as I have a number of midterms next week) I'll be adding options the dsg pipeline to evaluate LLM performance on explicit aspect reviews like we discussed in todays meeting :)
By @CalderJohnson
Preliminary work on the DSG (Implicit Dataset Generation) pipeline. Hello all,
I've created a preliminary pipeline for the generation of an implicit aspect dataset. I've done the following work: I modified semeval.py and review.py to be able to take in optional arguments for reviews containing implicit aspects. The semeval loader can now load XML reviews that have NULL target values.
I modified the Review object to have an additional attribute, a boolean array named implicit. A value of True at implicit[i] indicates that the corresponding aos[i] refers to an implicit aspect. I then modified get_aos() to return an aspect term of "null" when retrieving the aos associated with an implicit aspect. I created a pipeline with a filtering stage that only keeps reviews with implicit aspects.
It then has a generation stage that leverages GPT-4o-mini to label each review with a fitting term corresponding to the implied aspect.
Of course, this is a rough implementation, but it will serve as a baseline from which we can further tune/narrow our prompt structure, LLM choice, and other generation hyperparameters, as well as extend it to datasets other than semeval.
I've attached a screenshot of the labels generated from the toy dataset. In the aos field, the LLM generated aspect term is contained. So far, accuracy is promising but improvements must certainly be made.
@CalderJohnson thank you very much. this is very nice. just a quick note that in the code, there is an option where we say how to treat a raw review in dataset:
https://github.com/fani-lab/LADy/blob/c261acb313790cb53129947b4e889df18368eaa3/src/params.py#L12
Also, can we use None instead of "null"?
I've been modifying my preliminary pipeline to make components more modular and to incorporate multiple LLMs for the upcoming experiment. I've also resolved the None/"null" issue Dr. Fani mentioned above.
A challenge I've encountered is that the main Python interface to Google's Gemini (an LLM we planned to test the effectiveness of for this task) requires Python version 3.9 to work API reference
I was wondering if there's a specific reason we are keeping LADy on Python 3.8. If there is, I can circumvent this by making the request directly using google's REST API with Python's requests library. If not, I'll try updating and seeing if the pipeline still works. It would be nice to have modern Python features like the match statement as well.
Thank you, @CalderJohnson , for your update.
Currently, all the libraries used in LADy are based on Python 3.8. If we switch to Python 3.9, I think we will face a series of version compatibility issues.
This is true, although most libraries maintain backwards compatibility. I will try creating a new environment with the newest version of each python/any libraries that need to be updated and see if the pipeline still runs. If I run into compatibility issues, I'll just query the API manually for Gemini.
Sounds like good plan!
Just an update: I've created the scaffolding for the experiment (coded the evaluation metrics, etc.), and ran it on the one model I have currently set up (gpt-4o-mini).
Graphed the results here (only tested on the toy example so far): results chart
They look poor, but I believe this to be due to the way I checked if the predictions were similar to ground truth. My threshold for similarity may have been too high, as it didn't pick up on similarities such as "food" and "chicken" (which usually are somewhat unsimilar words, but in the context of a restaurant should be treated as somewhat synonymous).
Next steps are to improve the way I measure similarity and of course integrate more models to test.
Also, let me know if I should use more/different evaluation metrics.
As you can see in the chart "precision" and "exact matches" are the same so my evaluator only flagged them as the same if the wording was identical. I'll be working on changing this and getting an updated (more accurate) chart together.
Thank you @CalderJohnson for the update. We can also test the top five results. I have an idea, recently reviewed an article discussing data augmentation methods, particularly an alternative to synonym replacement. While synonym replacement may inadvertently shift sentiment, using hypernyms instead can help the model generalize terms without altering context. By providing examples like 'primate' as a hypernym for 'human' or 'food' as a hypernym for 'chicken,' we can encourage the model to recognize broader domain categories. This strategy could reduce instances where contextually accurate predictions are incorrectly marked as errors.
Good to know! I'll keep this in mind if I'm unable to get good results from comparing the embeddings alone.
We’re so happy to have you on board with the LADy project, Calder! We use the issue pages for many purposes, but we really enjoy noting good articles and our findings on every aspect of the project.
We can use this issue page to compile all our findings about LLMs for data generation. A great article to start with is "On LLMs-Driven Synthetic Data Generation, Curation, and Evaluation: A Survey", which you can also find in the team’s article repository.
The key questions we’re exploring are: Which language models perform best in data creation (considering the domain and the task at hand), and what are their advantages and disadvantages? As you go through the suggested paper and similar ones, feel free to add and suggest articles in both the Google Doc and here.
Once we've covered the research, we’ll dive into Q1, as mentioned by Hossein in today’s session, where we’ll test the LLMs on our gathered dataset.
If you have any questions, feel free to ask here and mention either me or Hossein!