Open lkcao opened 8 months ago
Palakodety, KhudaBukhsh, and Carbonell (2020) leverage a fine-tuned version of BERT trained on YouTube comments to mining deeper insights into larger trends of political perception and opinion in India. In particular, their study uses BERT's predictions for cloze questions to gauge these sentiments. While they highlight that their corpus is very messy -- hence demonstrating the robustness of the fine-tuning approach -- one of the aspects they do not seem to discuss is the representatives of this corpus. It's well-demonstrated that BERT can make robustly predictions about the opinions of those captured in the corpus on which it is trained, but it's less clear that BERT's insights generalize to the larger population of people who do not comment on YouTube videos about Indian politics, who still might be relevant actors in the social world of interest.
Are there established approaches for clawing back some of this generalizability to other groups when using a fine-tuned model for social insights, or is this a situation where the corpera available limit what we can do?
how has the decay of the dataset changed since publication? for example, there have been many developments since this was published, has our advance to LLMs changed the decay in the dataset as a result of fine tuning?
I think this is very interesting study, I am thinking is it possible to use the possibility of predicted masked token as measurement of the difference of corpus's opinion. For example, when ask [MASK] will win, the model assign a high possibility to Modi than Gandhi, can the difference of the possibility tell how likely the comment think Modi will win, comparing to Gandhi? Also, it is known than BERT have its pre-existing opinion and bias, how could make sure the fine-tuned model reflect the corpus's opinion, rather than BERT's own opinion, without a validation dataset?
The study mentions the challenge of code-mixing and grammatical inconsistencies in social media text. I'm curious about the specific preprocessing steps and modifications made to the corpus to address these issues. How do these steps impact the model's performance and insight generation?
How does model accuracy change with live social media data? What methods could adjust it for new trends and behaviors?
Post questions here for this week's exemplary readings: