UChicago-Computational-Content-Analysis / Readings-Responses-2024-Winter

1 stars 0 forks source link

6. Large Language Models (LLMs) to Predict and Simulate Language - [E1] Palakodety, Shriphani, Ashiqur R. KhudaBukhsh, Jaime G. Carbonell. #26

Open lkcao opened 8 months ago

lkcao commented 8 months ago

Post questions here for this week's exemplary readings:

  1. Palakodety, Shriphani, Ashiqur R. KhudaBukhsh, Jaime G. Carbonell. “Mining Insights from Large-Scale Corpora Using Fine-Tuned Language Models”. Frontiers in Artificial Intelligence and Applications, Volume 325: ECAI 2020
bucketteOfIvy commented 7 months ago

Palakodety, KhudaBukhsh, and Carbonell (2020) leverage a fine-tuned version of BERT trained on YouTube comments to mining deeper insights into larger trends of political perception and opinion in India. In particular, their study uses BERT's predictions for cloze questions to gauge these sentiments. While they highlight that their corpus is very messy -- hence demonstrating the robustness of the fine-tuning approach -- one of the aspects they do not seem to discuss is the representatives of this corpus. It's well-demonstrated that BERT can make robustly predictions about the opinions of those captured in the corpus on which it is trained, but it's less clear that BERT's insights generalize to the larger population of people who do not comment on YouTube videos about Indian politics, who still might be relevant actors in the social world of interest.

Are there established approaches for clawing back some of this generalizability to other groups when using a fine-tuned model for social insights, or is this a situation where the corpera available limit what we can do?

donatellafelice commented 7 months ago

how has the decay of the dataset changed since publication? for example, there have been many developments since this was published, has our advance to LLMs changed the decay in the dataset as a result of fine tuning?

beilrz commented 7 months ago

I think this is very interesting study, I am thinking is it possible to use the possibility of predicted masked token as measurement of the difference of corpus's opinion. For example, when ask [MASK] will win, the model assign a high possibility to Modi than Gandhi, can the difference of the possibility tell how likely the comment think Modi will win, comparing to Gandhi? Also, it is known than BERT have its pre-existing opinion and bias, how could make sure the fine-tuned model reflect the corpus's opinion, rather than BERT's own opinion, without a validation dataset?

joylin0209 commented 7 months ago

The study mentions the challenge of code-mixing and grammatical inconsistencies in social media text. I'm curious about the specific preprocessing steps and modifications made to the corpus to address these issues. How do these steps impact the model's performance and insight generation?

Brian-W00 commented 7 months ago

How does model accuracy change with live social media data? What methods could adjust it for new trends and behaviors?