9. Large Multi-Modal Models (LMMMs) to Incorporate Images, Art & Video - Challenge

lkcao commented 10 months ago

Post your response to our challenge questions.

What image or audio data could be relevant to your course/thesis/life research question(s)? What intuitions do you have about the broad content patterns you would express across and between text, image, and/or audio data? Could pre-trained models allow you to unleash image or audio features that would allow you to parsimoniously test this to support your project?

yuzhouw313 commented 9 months ago

Considering my corpus are comments scraped from YouTube news videos, one audio data extremely relevant to classifying the sentiment, emotion, or topic embedded within the comment section is the transcript obtained from the video itself from speech recognition technique.
One intuition I have is the correlation between the topic discussed in the news video recognized by audio and the discussion focus within the comment section. For example, if the news reporter is discussing how China unleashed the covid-19 virus as bioweapon by speech recognition, I would imagine the comments to contain negative sentiment and perhaps angry emotions.
Pre-trained models like BERT or RoBERTa can help me extract and analyze features from both audio (via transcripts) and text data, while Speech2Text can help me with speech recognition task.

XiaotongCui commented 9 months ago

Relevant Image Data: For the research question on how wealth influences the sentiment and tone of messages on OkCupid, dating profile photos would indeed be highly relevant. These images can reveal a lot about the demographics, lifestyle, and perhaps even the perceived social status of the individuals involved. Here are some aspects of dating profile images that could be explored:

Perceived Wealth Indicators: This includes attire, accessories, and background settings that might suggest wealth or status.

Expression and Pose: Confidence might be reflected in how individuals present themselves in photos.

Lifestyle Clues: Images might show travel to exotic locations, attendance at expensive events, or possessions that imply affluence.

Intuitions: Individuals with higher reported incomes may exhibit images featuring expensive clothing, luxury items, or glamorous settings.
Use of Pre-Trained Models: Pre-trained models in computer vision could be immensely helpful in extracting features from these dating profile images. For instance: Image Classification: Models trained on large image datasets could classify photos into categories such as "high wealth indicators" and "low wealth indicators." Object Detection: Detecting specific items like luxury cars, designer clothing, or exclusive venues in the images. Facial Expression Analysis: Understanding the emotions or confidence levels expressed by individuals in their profile pictures.

Marugannwg commented 9 months ago

My dataset on game/drama plot is intrinsically linked to images -- the major characters engaged in dialogues all have respective profile pictures, and some stories have art/video company them. Not to mention the enormous amount of fan-art.

I'm trying to capture the archetype presented in the dialogue interaction, probably there are similar archetypes to be observed in the character picture (interpreted by pre-train image model.)

Using pre-trained computer vision models, the character profiles might be grouped into different categories. I'm curious whether such categorization is a representation of certain archetyped detect from their dialogues. e.g. protagonist vs antagonist?

floriatea commented 9 months ago

In the context of my telehealth research, images of telehealth platforms’ interfaces can be analyzed to understand usability and accessibility features. Patterns in layout design, iconography, and navigation elements can offer insights into how user-friendly these platforms are for diverse patient populations. Images used in patient education (infographics, instructional images...) can reveal the types of information prioritized by telehealth services and how effective they are. In some telehealth fields, especially dermatology, quality images are central to the accurate diagnosis process.

Audio recordings of telehealth sessions, can provide insights into the communication dynamics, language use, and patient-provider rapport in a remote setting. Audio data from interactions with voice-assisted telehealth services or virtual health assistants can be analyzed to understand user queries, the effectiveness of the AI responses, and areas for improvement in voice recognition accuracy.

Intuition: Those integration could indicate a high-quality telehealth service that accommodates various patient needs, including those with visual or hearing impairments. It could help improve accessibility, inclusivity, and overall quality of care provided through telehealth platforms, but also raise a lot of privacy concern.

Pre-trained models like Convolutional Neural Networks (CNNs) can be used to analyze telehealth interface screenshots for usability studies or to identify features in diagnostic images. Transfer learning can adapt these models to specific telehealth contexts with limited dataset sizes. Speech recognition and natural language processing (NLP) can transcribe patient-provider conversations, analyze sentiment, medical terminologies, identify common themes or concerns raised during telehealth sessions, and enhance accuracy in clinical contexts.

sborislo commented 9 months ago

Images of videogame lootbox items would be useful for examining the influence of the “business” of these items on their desirability. I would expect the broad pattern that “busier” lootbox items are more desirable, as exemplified either by online text (e.g., in discussion forums) or the listed categorical rarity of the items in the lootbox. I believe pre-trained models could help enormously with answering this question, or revealing that some other visual feature is driving rarity ratings / desire for the items, by examining which image features are correlated with these outcomes. This could be the amount of differential “sub-images” in an image of a lootbox item (as predicted), or certain colors, texture complexity, and so on.

muhua-h commented 9 months ago

My research is in psychometrics and LLM-agents. One possible use of LMMMs is to simulate the profiles of individual with different psychological traits, which help us to obtain a more direct perception of different people. And to stretch it, we can use agents and LMMMs to simulate characters in video games, with audio and images.

donatellafelice commented 9 months ago

What image or audio data could be relevant to your course/thesis/life research question(s)? I could use video data and audio data in my research. text is actually somewhat limiting in conversational research, because conversations are, in themselves, special. especially face to face, non-text based conversations. there is so much richness to video and audio recordings that is often not captured in text: like eye movements, sighs, or pauses.

What intuitions do you have about the broad content patterns you would express across and between text, image, and/or audio data? i imagine that one interesting test would be to evaluate how much eye contact is made through a video call and how much the person is distracted or multitasking. i would imagine that given attention would be a very interesting marker of various social and psychological games. specifically, i can see eye contact in a debate being a sign of escalation. someone is unlikely to be very distracted if the subject is something they care about. i am also curious if muting or unmuting during zoom calls, and thus getting rid of backchannel and subtle modes of agreement, changes a conversation significantly.

Could pre-trained models allow you to unleash image or audio features that would allow you to parsimoniously test this to support your project? i think audio analysis would be a great place for me to start as there are various conversations i have with different people disagreeing about the same topic, and using many similar words. it would be interesting to see how the different audio qualtiies of their voices with those words were vectorized or embedded

Caojie2001 commented 9 months ago

Since my research is about newspaper articles, a possible approach is to include the pictures on the newspapers into analysis. As most official newspapers in China have been digitized, these pictures can be scraped in a similar way the texts are scraped. Based on this dataset, we can further analyze the pattern of newspaper pictures, such as the figures involved in these pictures, to further support the conclusions drawn from the text dataset.

bucketteOfIvy commented 9 months ago

My project focuses on 4chan, which is an imageboard. So far, I've lacked any analysis of the images posted themselves, which image description methods might help me analyze.

Images on 4chan are often used as either the topic of discussion itself (on posts which only include an image), and at other times used alongside text to set up a discussion. Intuitively, I think the three most salient categories of images on the site (i.e. images that capture 90% of cases) are selfies (common on /passgen/), wojak-style memes, and screenshots of other content (e.g. tweets, news articles, etc), although I could be very off base on this.

The main limitation with pre-trained models when applied to 4chan is that 4chan is filled with Wojak memes, which the models are unlikely to have encountered before. Given that these memes are fairly unique, it would likely take quite a bit of fine-tuning to ensure that the models are running correctly. While this is true broadly, the issue holds less for text-extraction models, which might be able to extract useful amounts of text from some of the images posted on the site.

ethanjkoz commented 9 months ago

My research is mainly focused on the discussion board like interactions and representations (i.e. interactions with users on targeted subreddits). Possible image data that could be relevant to my topic would be images containing text itself because this is a common occurrence on other social media platforms like Instagram. Broadly I would suspect that on places like Instagram, the content of these posts might skew much more negative towards adoption. After a brief perusal of hugging face, there are definitely a handful of useful models to accomplish this task.

michplunkett commented 9 months ago

For our final project, I think we could make use of (presuming the data exists) audio files containing anti-reproductive right chants and images containing signage supporting the same cause.
Using these non-text forms of media, we could perform a similar analysis to what we're currently doing, seeing how they changed over time, and check if there are any similarities between the messaging in the chants/images and SCOTUS decisions/Congressional legislation.
Similar to @yuzhouw313's project, pre-trained models could be used to parse out speech from both audio and visual files to help make them easier to analyze.

volt-1 commented 9 months ago

In my project, which involves analyzing datasets from real dating app users, including (age, sex, self-bio, and religion etc.) . My intuition is that merging textual data with visual or auditory information could reveal nuanced user dynamics and preferences. Specifically, by employing pre-trained models such as DALL-E 3, Stable Diffusion, or Midjourney to generate images from clustered user profiles, we aim to scrutinize and visualize the stereotypes these AI systems might hold. By masking sensitive information like race in our inputs, we can critically assess how these models perpetuate or challenge societal stereotypes.

runlinw0525 commented 9 months ago

Since I am analyzing responses to generative AI technologies among selected course syllabi from a particular public university in the United States, the majority of my data will be text. However, if I only extract text, the overall structure of each course syllabus may be neglected in the analysis. If I treat different course syllabi as different images with different sizes and structures, it might be more helpful to study the underlying patterns of attitudes towards generative AI. And there are some great image classification models available in the Hugging Face community.

ana-yurt commented 9 months ago

For my research, I am very interesting in the cultural representation of minority groups in China. Intuitions: Cultural representation of Turkic minorities (Uyghur) overwhelmingly augments exotic and "Eurasian" features on the female face. I will explore if pre-trained models can detect those traits.

QIXIN-LIN commented 9 months ago

In my research on fan-fiction, incorporating fan art and fan videos is crucial for a comprehensive analysis. These visual and auditory elements could reveal trends similar to those in fan-fiction, potentially darker and more extreme themes. Pre-trained models, particularly for image analysis, offer a promising avenue for efficiently extracting thematic information from fan art. By comparing themes across text, images, and videos, I aim to uncover broader content patterns within fan communities. This approach will enhance our understanding of fan culture, allowing for a nuanced exploration of how themes manifest across different media forms.

yunfeiavawang commented 9 months ago

I am curious about what features make a short video viral on TikTok. Empirically, we can tell that a short video should consist of genre, characters, subtitles, music, etc, among which I suppose the music is the key component that drives the video to retransmit far and fast. It will be interesting to explore the shared features of the viral short videos and see if they are predominantly using the same several songs, and what the shared characteristics of these songs are. Furthermore, we can inspect correlations between specific songs and the genre of the short videos, for example, New Horizon is always the background music for travelling.

Twilight233333 commented 9 months ago

In my research, I studied the impact of presidential speeches on aid to Mexico. If I can better combine the video of the president's press conference, the expression analysis through computer vision may help me better grasp the president's attitude. I can do this using existing facial recognition and expression analysis codes.

chenyt16 commented 9 months ago

In my course project, I'm examining abortion-related news, some of which are presented in video and audio formats. Analyzing these multimedia sources would enable me to expand my dataset.

I anticipate that by extracting captions, taglines, and conversations from these videos, we may observe a similar pattern (e.g. wording preference) as seen in other news articles from the same source.

I believe leveraging pre-trained models could be beneficial. Since directly transcribing video data into text might overlook important nuances, I'm curious if pre-trained models can aid in sentiment analysis by interpreting human facial and gesture expressions.

naivetoad commented 9 months ago

Data: photos of research facilities and laboratories to analyze the infrastructural development correlating with funding levels and academic output Patterns: Image data could visually represent the growth or limitations experienced by research projects due to funding. Pre-trained models: For images, convolutional neural networks (CNNs) could be used to identify and classify visual patterns related to research infrastructure or outputs. They could help identify correlations between funding levels and the scope or focus of research projects.

HamsterradYC commented 9 months ago

Data： Social Media Images: Images posted by users on social media platforms can be analyzed to understand emotional expressions in different contexts. These images might include selfies, pictures of surroundings, activities, or events that carry emotional significance to the user. For a comparative study, images depicting various environmental settings (e.g., urban vs. rural, indoor vs. outdoor) could be relevant. These images can help analyze how different environments influence online emotional expression.

Patterns hypothesize: Images in natural settings (parks, beaches) are associated with positive emotions, while images in crowded or chaotic environments might correlate with negative emotions or stress.

pre-trained model: Models like ResNet or Inception could help identify specific objects, settings, or activities in images that are associated with certain emotional expressions. This can automate the process of categorizing images based on environmental elements. Also, we can use Google API or other cloud services to classify image elements. For analyzing captions and comments, pre-trained NLP models (e.g., BERT, GPT) can be used to understand the sentiment and emotional tones in textual content accompanying images.

erikaz1 commented 9 months ago

One significant challenge I encountered in my current project was sourcing properly digitized content for my corpus, particularly with older newspapers and books (dating 2-3 centuries back). While some books may have be scanned, they are often not readily available in text format, which is necessary for text analysis. To address this issue, I have considered exploring the option of leveraging pre-trained OCR models to extract text from scanned images of handwritten or printed text pages. This would be highly useful for processing old diaries and journals. There are many OCR models with friendly user interface available for public use (though it seems some require more training than others). For my use case, the application of these models would primarily be for data collection purposes. The outcomes of applying these methods do not directly contribute to the analytical results of the data.

Carolineyx commented 9 months ago

I would like to utilize image and audio data collected from various cities in the US to examine the ambiance of urban environments. I believe these data can complement geolocation social media text data, allowing me to understand the living or staying experience in different parts of a city. By analyzing objects and audio data, I can directly identify environmental factors that influence these experiences. There are ongoing global urban projects that collect and analyze sounds, images, and city sentiments. I intend to leverage these data to pre-train my model. Subsequently, by combining it with my nuanced emotion dataset, I can conduct more detailed hypothesis testing.

joylin0209 commented 9 months ago

My research materials are the content of posts in online forums. For image data, the following are a few intuitions:

Text and picture association: It can analyze whether the poster attaches what kind of pictures to their posts. For example, are there images that correspond to the content of the post, or are there images that might have a hidden meaning, such as a sarcastic meme.
Picture content and emotion: You can explore whether the emotional expression presented in the picture echoes or contradicts the remarks in the post. This helps me understand the role of the image in the post and its emotional impact on the reader.

To apply this data in research, I can use pre-trained models, such as convolutional neural networks (CNN) or recurrent neural networks (RNN), to extract image and audio features. These features can be used to build models to analyze the relationship between text, images, and audio in posts and further understand misogyny in online forums.

Brian-W00 commented 9 months ago

Our research question is about the language differences between different types of communities. There might be some audio info; we could use the pre-trained model to transfer the audio information to text or use the pre-trained model to analyze the sentiment of the speaker directly.

Dededon commented 9 months ago

I'm curious about what are the implications of the CV techniques such as image inpainting or deepfake. We can conduct sociopsychology experiments with the generated images according to the text, or edited real-life images to see whether there's an effect on the political psychology

JessicaCaishanghai commented 8 months ago

Computer vision is a very heated topic. How can we construct the robustness test for those relatively new and cutting edge methods to further validate them ? Compared with other economic studies, it's lacks certain testing procedure.

cty20010831 commented 8 months ago

For me, I think I am interested in examining the dialogues between scholars during conference meeting, for instance. I am interested in how dialogues flow to facilitate knowledge generation. I do not know much about pre-trained models related to audio features, but I think they could definitely provide more insights than textual data (of dialogues) alone.

beilrz commented 8 months ago

The image and audio data relevant to my final project is the news images or audio I found on news media . I believe news bias could not only manifest on the text of the news reporting, but also the choice of image and audio (video) of political figures. 2. My intuition is that news media will choose negative image for their political opponent. 3. I plan to use GPT-4V for this task. It is pre-trained, albeit not cheap. I feel I need a more comprehensive model to conduct a more model free analysis of the images. I could also consider use framework to extract emotion and expression of political figures, at a cheaper cost.

Vindmn1234 commented 8 months ago

The research would collect image data of various urban areas, capturing the presence and quality of green spaces. Audio recordings of ambient sounds in these areas could also be collected to gauge levels of noise pollution. By applying pre-trained neural networks, the study could efficiently analyze visual elements of greenery and audio indicators of tranquility or noise. The combined image and audio features, such as the amount of green space and the decibel levels, could then be correlated with survey data on community well-being, providing a multi-faceted view of the urban environment's impact on social health. This design integrates diverse data modalities to explore a complex social science question.

YucanLei commented 8 months ago

One thing would be the image or video data of perhaps a footage of the games. Or maybe the trailer of the game. The trailers can effectively determine consumer's expectations of the game. Thus heavily influencing the outcome of the sentiments when the game launched.

UChicago-Computational-Content-Analysis / Readings-Responses-2024-Winter

9. Large Multi-Modal Models (LMMMs) to Incorporate Images, Art & Video - Challenge #1