Thinking-with-Deep-Learning-Spring-2024 / Readings-Responses

You can post your reading responses in this repository.
0 stars 1 forks source link

Week 8. May. 10: Multi-Modal Learning - Possibilities #16

Open JunsolKim opened 3 months ago

JunsolKim commented 3 months ago

Pose a question about one of the following articles:

Online images amplify gender bias,” 2024. Guilbeault, Douglas, Solène Delecourt, Tasker Hull, Bhargav Srinivasa Desikan, Mark Chu, and Ethan Nadler. Nature.

Using sequences of life-events to predict human lives”. 2024. Savcisens, Germans, Tina Eliassi-Rad, Lars Kai Hansen, Laust Hvas Mortensen, Lau Lilleholt, Anna Rogers, Ingo Zettler, and Sune Lehmann. Nature Computational Science.

Color associations in abstract semantic domains”. 2020. D. Guilbeault, E. O. Nadler, M. Chu, D. R. Lo Sardo, A. A. Kar, B. Srinivasa Desikan. Cognition 201: 104306.

UniDoc: Unified Pretraining Framework for Document Understanding” (2021) “Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer” (2020) “TaBERT: Pretraining for Joint Understanding of Textual and Tabular Data” (2020) “The Hateful Memes Challenge: Detecting Hate Speech in Multimodal Memes” (2020) “ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks” (2019) “A Review on Explainability in Multimodal Deep Neural Nets” (2020) “A unified model of human semantic knowledge and its disorders” (2017) “Deep fusion of multimodal features for social media retweet time prediction” (2021)

kceeyang commented 1 month ago

300-400 word reflection: We are exploring multi-model transformers this week,  and I found the article “VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text” to be a great supplement reading to learn how transformers could be used for multimodal self-supervised learning and their advantages over other methods. In this article, the authors first explained the two main challenges that the large-scale supervised training of Transformers are facing. The first problem is that the strategy could produce biased systems since it ruled out many unlabeled visual data. The second problem is that the performance of the transformers is limited by this computationally intensive supervised training strategy since it is costly to collect sufficient labeled data for training and correcting biases.

Thus, the authors proposed a convolution-free Video-Audio-Text Transformer (VATT) that would take large-scale, unlabeled visual data as input and learn vision, audio, and language representation. The architecture of VATT borrows from BERT and ViT, however, each modality reserves the layer of tokenization and linear projection separately. In their experiment, their modality-agnostic transformer shows that it is possible to use one backbone shared across all modalities. In addition, they also proved that the VATT outperforms the ConvNet-based architectures in video action recognition, audio event classification, image classification, and text-to-video retrieval. Furthermore, the authors presented DropToken, an efficient technique that can significantly reduce the training complexity with a minor impact on the models’ performance.

The authors’ introduction of the modality-agnostic, single-backbone VATT Transformer in this paper presents a valuable tool for social science research projects dealing with vast amounts of unlabeled, raw image or video datasets. This transformer can aid in learning semantic video/audio/text representations, thereby enhancing the understanding of complex social phenomena. The multi-modal self-supervised strategy it employs also reduces the model’s reliance on large-scale labeled data, making it a cost-effective solution. The DropToken technique, with its ability to support high-resolution input by randomly dropping “a portion of the video and audio tokens from each input sequence during training,” further enhances the practicality of the VATT Transformer in social science research.

guanhongliu2000 commented 1 month ago

I would recommend the article Multi-Model Fusion Framework Using Deep Learning for Visual-Textual Sentiment Classification written by Israa K. Salman Al-Tameemi, Mohammad-Reza Feizi-Derakhshi, Saeed Pashazadeh and Mohammad Asadpour in 2023.

The article presents a novel Multi-Model Fusion (MMF) framework for visual-textual sentiment classification that addresses the complexities of handling multimodal data in sentiment analysis. Traditional approaches often struggle with the nuanced relationships between different modalities like text and images, resulting in subpar classification accuracy. The proposed MMF framework integrates three deep neural networks to process and merge these modalities effectively, leveraging the strengths of each to improve overall sentiment analysis.

The MMF model consists of two separate neural networks designed to extract emotionally relevant features from both text and images. These networks focus on identifying the most discriminative features crucial for accurate sentiment classification. A third component, a multichannel joint fusion model, then combines these features using a self-attention mechanism. This model is designed to capitalize on the intrinsic correlations between the textual and visual data, assembling a more comprehensive sentiment analysis framework.

Further enhancing the MMF model's capabilities, a decision fusion strategy is employed to integrate the outputs of the three neural networks. This approach not only boosts the robustness of the model but also improves its ability to generalize across different datasets. The MMF framework is also designed to be interpretable, incorporating the Local Interpretable Model-agnostic Explanations (LIME) technique, which adds a layer of transparency by explaining the model's decision-making process.

Empirical evaluations of the MMF model on four real-world sentiment datasets demonstrate its superior performance compared to both single-modality models and other state-of-the-art multimodal approaches. The results show remarkably high classification accuracy across these datasets, confirming the effectiveness of the MMF framework in handling the complexities of multimodal sentiment analysis. The integration of deep learning techniques with advanced fusion strategies marks a significant advancement in the field, suggesting that the MMF model could be a valuable tool for applications requiring nuanced understanding of user sentiments across multiple data types.

XueweiLi1027 commented 1 month ago

For this week's reading, I recommend A Novel Deep Learning Multi-Modal Sentiment Analysis Model for English and Egyptian Arabic Dialects Using Audio and Text

Reflection The paper introduces a novel sentiment analysis model named Audio-Text Fusion (ATFusion), which leverages both audio and text data to detect emotions. This multi-modal approach employs local classifiers for each input type, followed by a fusion technique known as Group Gated Fusion (GGF). The model utilizes Convolutional Neural Networks (CNN), Long Short-Term Memory (LSTM) networks, and transformers as its building blocks. The ATFusion model was evaluated using the IEMOCAP dataset for English and the EYASE dataset for the Egyptian Arabic dialect, achieving high accuracy rates in both cases. The ATFusion model could significantly extend social science analysis by providing a more nuanced understanding of human interactions and emotional responses in various social contexts. For instance, in political science, it could be used to analyze public sentiment towards policies or leaders by processing speeches and social media texts.

A rough plan of implementing the ATFusion model could be: Collect annotated text and audio data from sources like social media and customer service logs. Preprocess the data, extracting features for audio and normalizing text. Train the model, adjusting for dataset imbalances with data augmentation. Evaluate the model's accuracy and robustness before integrating it into tools for real-time sentiment analysis. Utilize social media APIs, customer interaction recordings with consent, and political speech archives for diverse data.

Question The paper mentions the application of the model to both English and the Egyptian Arabic dialect, implying a need for the model to be sensitive to cultural nuances in emotional expression. Ensuring that a sentiment analysis model is effective across cultures is crucial, as the same words or vocal tones might convey different emotions in different cultural settings. Therefore, I was wondering how does the ATFusion model handle the complexity and variability of emotional expressions in different cultural contexts, and what steps were taken to ensure the model's cross-cultural applicability?

erikaz1 commented 1 month ago

The Savcisens et al. (2024) paper is really fascinating! The level of granularity in their demographic data (down to wrist fractures, chemical engineering technicians, etc.) is incredible. From my understanding, all demographic/health data is converted into a “synthetic language” or some kind of text-based narrative summarizing the data, which is then used to train transformer models (44). From these models they then extract concept dimensions and tokens. My first question is, what might that underlying synthetic language look like? Is it as bland of a biography of each individual as possible?

I also found some other things curious. 2. Why did the authors truncate their early mortality prediction at just 4 years out? 3. Why is the dimension-reduced income “concepts” shaped like a question mark, and not a line like the birth-year concept?

HongzhangXie commented 1 month ago

The study Using sequences of life-events to predict human lives, using labor and health record data from the Danish National Registry for the years 2008-2016, predicts the likelihood of death between 2016 and 2020 for individuals aged 35-65. The authors employed a tool called life2vec to create embeddings of life-events in a single vector space, demonstrating that this embedding space is robust and highly structured. In terms of prediction results, models based on deep learning, such as life2vec, Recurrent Neural Networks (RNN), and Neural Networks (NN), significantly outperformed traditional methods like logistic regression and life table approaches. The authors trained both life2vec and RNN using the same data and found that life2vec outperformed RNN. This may be due to the self-attention mechanism that allows each token to interact across the entire sequence, thereby capturing subtle long-term influences. Furthermore, the concept space of life2vec is entirely general, making it an interesting subject of analysis in its own right.

I suspect that some variables might leak short-term mortality information. For example, if a person's health records indicate a severe illness, the outcome might be relatively easy to predict. I am curious about the variation in predictive accuracy across different years when making predictions. For instance, if data from 2008-2016 were used to predict the mortality rates separately for 2017, 2018, 2019, and 2020, how significant would the variation in model performance be?

Xtzj2333 commented 1 month ago

Using sequences of life-events to predict human lives

This paper is fascinating. Personality is indeed a nuanced concept and I am surprised that the model could achieve a high accuracy on a 5 point scale. I wonder if there are ways to interpret the model? It would add to the psychology literature if we could know what life-event factors, according to the deep learning model, influence people's personality.

CYL24 commented 1 month ago

I recommend the article "Word embeddings quantify 100 years of gender and ethnic stereotypes."

This article proposed a novel approach to studying the temporal dynamics of gender and ethnic stereotypes using word embeddings. The authors leverage this technique to analyze changes in stereotypes over time by quantifying the associations between words related to gender, ethnicity, and neutral terms such as adjectives and occupations.

At first, the study demonstrates that word embeddings effectively capture historical biases and societal trends, as validated by comparisons with census data, historical surveys, and sociological literature. Further, by applying this framework, the authors uncover how stereotypes towards women and ethnic groups have evolved in the United States over the 20th and 21st centuries. They find significant correlations between embedding biases and demographic shifts, as well as changes in the portrayal of genders and ethnicities in literature and culture.

The article also demonstrates that Asian stereotypes in word embeddings over the 20th century actually reveal significant shifts in attitudes. Initially, negative descriptors like "barbaric" and "cruel" were associated with Asians, but by the 1980s, stereotypes shifted to more passive and complacent terms like "sensitive" and "passive." These changes coincided with major immigration waves and the emergence of second-generation Asian-Americans, which further shows that trends in embeddings reflect broader global shifts, such as decreasing associations between words related to outsiders and Asian stereotypes over time. Similar analyses were also conducted for stereotypes related to Islam and other ethnic groups, showing consistent patterns.

Overall, this article shows great possibilities of applying word embeddings to address complex social researches. It offers a quantitative and data-driven approach to studying stereotypes, providing insights into how language reflects and shapes societal attitudes over time, which might be inspiring for us.

beilrz commented 1 month ago

Using Sequences of Life-events to Predict Human Lives

I think this is a very interesting approach. Representing life course as vectors using life2vec indeed have many possible social science application. One possible extension of this project is to consider the social network of individual, if such data is available. This allows us to construct event sequence by potentially embedded relevant people's life event (for example, death of a relative). Another interesting area to study is to examine where the model fails: for example, if the model fail to predict a specific life event for a persons, this suggesting this is an unexpected event that may have long-lasting impact on one's life course.

HamsterradYC commented 1 month ago

I recommend FusionTransNet for Smart Urban Mobility: Spatiotemporal Traffic Forecasting Through Multimodal Network Integration

The study investigates FusionTransNet, a framework for spatiotemporal traffic forecasting in multimodal urban transportation systems. It aims to tackle the complex challenges in urban traffic resulting from the interactions among different modes such as taxis, buses, and bike-sharing systems. FusionTransNet integrates these diverse data streams to improve Origin-Destination (OD) flow prediction accuracy. The framework's three main components are the Intra-modal Learning Module, which analyzes spatial correlations within a single transportation mode; the Inter-modal Learning Module, which reveals interactions across modes; and the Prediction Decoder, which synthesizes insights and generates accurate flow predictions. Evaluated on datasets from major cities like Shenzhen and New York, FusionTransNet demonstrates superior predictive performance compared to existing state-of-the-art methods, leveraging its local-global fusion strategy.

This approach can particularly to understand urban dynamics and human travel patterns. For instance, this framework can be used to predict social gatherings within urban spaces, aiding the study of the urban life structure in sociology and its dynamic changes under varying conditions. Additionally, it can provide insights into the social equity implications of urban planning decisions. Particularly for large events or disaster response, understanding and predicting the flow of people can significantly enhance the effectiveness of management and response strategies.

We can use public transportation and mobile data can be utilized to analyze and predict urban social gatherings, especially useful for significant events like festivals, protests, or spontaneous assemblies. Public transportation data can be obtained from local transit agencies, including GPS coordinates of buses and subways, ticket information, and passenger boarding data. Mobile data can be sourced from telecommunications providers, recording location data related to calls and data usage. By aggregating and anonymizing this data, it is possible to identify patterns of travel and gatherings for specific events or periods. Integrating this data provides innovative empirical research opportunities in social sciences, offering new insights and solutions in urban planning, public safety, and social event management.

maddiehealy commented 1 month ago

I read Augmenting Social Science Research with Multimodal Data Collection: The EZ-MMLA Toolkit which introduces an innovative web-based toolkit platform that augments social science research through deep learning by analyzing multimodal data. The toolkit itself facilitates the collection and analysis of complex data types (i.e., eye-tracking, heart rate, body posture, facial expression) without the need for specialized hardware, making advanced research tools more accessible to educators and social scientists. I was intrigued by the deep learning aspect of the toolkit as it was a data-collection method I had not encountered before. It offers a method to process and interpret diverse data efficiently and accurately. Then, with this data, social scientists can then gain insights into human behaviors and interactions that are often missed by traditional observational methods. For example, the deep learning algorithms detect subtle patterns in physiological data that can indicate psychological shifts, something that is inaccessible with traditional observational methods. By leveraging deep learning, social scientists are able to gather more data than ever before, ultimately providing a more holistic understanding of human responses in social contexts. I see potential for this toolkit in a range of social science applications. One area of interest could be analyzing communication patterns across different cultures. Are there subtle body posture changes, eye movement shifts, or facial expressions that only deep learning technology can detect? This could answer questions about how different cultures communicate non-verbally. I would also like to understand the gap in detail between traditional observational methods and this new, augmented approach. How significant are these subtle changes, especially in the context of human communication and cross-cultural interaction strategies? Would applying a deep learning tool to detect subtle human movements during conversations be impactful, even if these movements are not perceptible to other participants in real-time?

Pei0504 commented 1 month ago

The article "Online images amplify gender bias" delves into how online images, especially from platforms like Google and Wikipedia, accentuate gender bias more significantly than text. It underscores that images inherently convey and reinforce gender stereotypes more effectively, affecting both explicit and implicit perceptions of gender roles. This raises critical questions: What specific properties of online images make them more potent in transmitting gender biases? How can technological and societal interventions be designed to mitigate the impact of these biases as visual communication becomes increasingly dominant?

La5zY commented 1 month ago

I read “Using sequences of life-events to predict human lives” the article discusses the development and refinement of predictive models that are capable of handling the complexities and variabilities inherent in human life event data. This includes addressing challenges like the sparsity of events—life events are infrequent but significant—and the need for models to generalize well to unseen data while avoiding overfitting.

One of the key aspects covered is the selection and preparation of relevant datasets. The article notes that life events are typically documented across various online platforms and administrative records, which provides a rich but often unstructured source of data. The process of data cleaning and feature engineering is therefore crucial, as it involves transforming raw data into a format suitable for training predictive models. This might include encoding categorical data, normalizing numerical inputs, and handling missing values to improve the robustness and accuracy of the models.

The methodology section of the article delves into the specifics of the machine learning algorithms employed. It explains the use of neural network architectures like recurrent neural networks (RNNs) and convolutional neural networks (CNNs), which are particularly adept at processing sequential and spatial-temporal data, respectively. The choice of model often depends on the nature of the data and the specific types of life events being predicted.

An interesting point raised in the article is the application of techniques such as transfer learning and data augmentation to enhance model performance. Transfer learning allows models trained on one task to be repurposed for another related task, leveraging pre-learned patterns to overcome the scarcity of labeled data in specific domains. Data augmentation artificially increases the size and diversity of training datasets by creating modified versions of existing data points, which helps in improving the model's ability to generalize.

MarkValadez commented 1 month ago

This week I read and would recommend "Kolmogorov-Arnold Networks."

I found this article interesting on multiple fronts:

  1. They are presented as an improvement on MLP's. This is highly relevant, because regardless of the limited use cases for bare-bones MLP implementation, the notion of dense layers as part of more complex architectural designs is still very relevant.

  2. It introduced me to the Kolmogorov-Arnold Representation Theorem which I enjoyed as furthering of my mathematical understanding and applications of analytic mathematics onto the design choices of data pipelines.

  3. One of the main arguments the article presents is the improvement of model explainability by building B-Spline based representations in order to approximate "underlying functions" inherent in sample data.

In particular, the third component is what makes the networks relevant to this weeks topic.

While there are edge cases to representation of non-smooth functions or time series which can be thought of as fractal spaces and of high interest of research, certain domains such as image processing or even smoothed time series (or transformed under something like spectral analysis or FFT) could be subject to analysis under KAN.

In the end if we can clearly define a phenomenon to behave in the form of a series of function compositions, then we can say we have reached the closest representation of explainable solutions under an analytic domain.

In other words, we are not left with a representation map such as a pool table which we then need to explain that the scattered bunch of red pixels in a white canvas are actually a pool table because certain segments of proximal points in the vertical axis of a 2-D plane, are in fact the likely-legs, an the proximal point which can be approximated horizontally and diagonally are in-fact the edges of the table and the cues.

It seems other representation methods are still missing that conceptual or semantic representation to be considered an "explanation."

mingxuan-he commented 1 month ago

For the life2vec paper, I'm particularly curious about the impact of missing/truncated data on model bias. Even if we trust the quality of Denmark's census process and assume everyone is recorded, early mortality would inevitably lead to data truncation bias introduced due to lack of records?

00ikaros commented 1 month ago

I recommend "Empathy Through Multimodality in Conversational Interfaces" https://arxiv.org/abs/2405.04777

The document recognizes the necessity for stringent floodplain management due to the historical and potential future flooding which poses risks to life, property, and economic activities. The regulations are designed to comply with the National Flood Insurance Program (NFIP). Moreover, the document reveals some detailed descriptions of the floodplain management process are provided, including requirements for buildings and constructions in flood hazard areas. It specifies the roles and responsibilities of the Floodplain Administrator in ensuring compliance and managing development in these areas. Regulations specify the requirements for the construction and modification of buildings in flood hazard areas. This includes elevating buildings to specified heights, using flood-resistant materials, and ensuring that any development is robust against flood conditions. It addresses zoning appeals and variances, providing guidelines under which exceptions to standard floodplain management practices may be granted. This includes considerations of hardship, community impact, and overall safety. By enforcing these regulations, the community enhances its resilience against flooding. This means less disruption to daily life and economic activities following flood events, as the infrastructure is better prepared to handle such situations. Nevertheless, compliance with these regulations affects the cost of flood insurance. Buildings that adhere to or exceed these standards may benefit from lower insurance premiums due to reduced risk, whereas non-compliance could lead to higher costs.

kangyic commented 1 month ago

Using sequences of life-events to predict human lives How do you envision the findings of this research being applied in real-world settings, such as healthcare, education, or social policy? What are the ethical considerations and implications associated with using predictive models based on detailed life-event data? How can this methodology be extended or adapted to explore other populations or datasets, and what new insights could be gained?

hantaoxiao commented 1 month ago
  1. In the article "Using sequences of life-events to predict human lives," how might the implementation of the life2vec model vary when applied to different cultural contexts? What specific life events or societal factors should be considered when adapting this model for diverse populations?
  2. The paper "Online images amplify gender bias" explores how visual content can reinforce stereotypes more effectively than text. What innovative strategies could be developed to counteract this bias in digital media, and how could AI and machine learning be used to promote gender equity in visual representations?
anzhichen1999 commented 1 month ago

With that the pretraining tasks include Masked Sentence Modeling (MSM), Visual Contrastive Learning (VCL), and Vision-Language Alignment (VLA), designed to enhance the model's ability to understand and represent documents in a multimodal context, I have the following question: Based on the unique challenges and innovations introduced by UDoc in the realm of document understanding, particularly its integration of visual and textual information and the use of multimodal embeddings, what are the potential advantages and limitations of this approach in practical applications such as form understanding and receipt processing? How does the hierarchical transformer encoder contribute to handling complex document structures, and what are the key considerations in ensuring the effective alignment of visual and textual data within this framework?

00ikaros commented 1 month ago

How does the Implicit Association Test (IAT) measure mental associations between target pairs and category dimensions, and what is the theoretical basis for expecting faster response times for consistent mental associations? In the context of your study on judge decisions about whom to jail, how did you design the IAT using the iatgen tool, and what specific target pairs and categories were used? Additionally, how do you standardize reaction times to calculate the D score, and what does this score indicate about implicit biases? Finally, how did you measure the strength of gender associations in Google images and textual descriptions encountered by participants, and how were these associations evaluated against participants' explicit and implicit biases?

icarlous commented 1 month ago

The article “Online Images Amplify Gender Bias” examines how platforms like Google and Wikipedia reinforce gender stereotypes through images more than text. What makes images so effective in transmitting biases, and how can we mitigate this impact as visual media grows?

Carolineyx commented 1 month ago

For this week, I would like to recommend: "Attention and Meta-Heuristic Based General Self-Efficacy Prediction Model From Multimodal Social Media Dataset"

The article presents a comprehensive approach to predicting General Self-Efficacy (GSE) from social media data. The study employs both tool-based and deep learning-based methods to extract features from Facebook statuses and profile photos. In the tool-based approach, LIWC and BERT are used for text feature extraction, while Mediapipe and DeepFace are used for image features. The deep learning-based method incorporates BERT, 1D-CNN for text, and UNet++, VGG16, and ResNet-152 for image feature extraction, with features fused via Canonical Correlation Analysis (CCA) and co-attention mechanisms. The models demonstrate high accuracy in predicting GSE scores, with the hybrid model combining text and image features showing the best performance.

Extending Social Science Analysis:

The methodologies described in the article can extend social science analysis by providing deeper insights into how individuals' self-efficacy can be predicted from their online behavior and interactions. By leveraging multimodal data from social media, researchers can explore the relationship between digital behavior and psychological attributes, such as self-efficacy. This approach can be applied to various domains, such as understanding the impact of social media on mental health, identifying at-risk individuals, and tailoring interventions to improve well-being. Additionally, the integration of multimodal data allows for a more nuanced analysis of human behavior, capturing the complexity of social interactions and self-perception.

Pilot Use of Social Data:

To pilot the use of this approach in extending social science analysis, I propose a study focusing on predicting job performance and career success based on self-efficacy inferred from social media data. The social data required would include:

Textual Content: Posts, comments, and status updates from professional social media platforms like LinkedIn. Visual Content: Profile photos, images shared in professional contexts, and visual elements of user profiles. Engagement Data: Interaction metrics such as likes, shares, comments, and endorsements. Demographic Information: Age, gender, education level, industry, and job role.

Brian-W00 commented 1 month ago

The article "Online images amplify gender bias" explores the relationship between the distribution of images and text in online platforms and gender bias. What specific technical or policy measures can be implemented to reduce this image-enhanced gender bias? Furthermore, how do these biases manifest themselves differently across cultural and national contexts?

erikaz1 commented 3 weeks ago

New possibility reading: “InternVideo2: Scaling Video Foundation Models for Multimodal Video Understanding” (Wang et al. 2024) introduces a video foundation model called InternVideo2. As of March 2024, it is the best model for video captioning, dialogue extraction, action recognition, and various other tasks related to understanding and reasoning across long contexts. The model was trained through a three-stage “progressive learning scheme” involving transfer learning and masked token estimation/reconstruction using vision, audio, and text encoders. In addition to this “cross modal contrastive learning”, InternVideo2 offers significant advancement by scaling the training process with a large-scale multimodal dataset. This enhances the model's spatiotemporal perception, semantic alignment across modalities, world modeling capabilities, and results in a state-of-the-art performance across a wide range of video and audio tasks.

InternVideo2, can be used to significantly extend social science analysis by offering advanced tools for analyzing and interpreting video data. By leveraging its superior performance in tasks such as video retrieval, captioning, and multi-choice video question answering, social scientists can gain further insights into human behavior, communication patterns, and social interactions as captured through video content. Its ability to align and fuse video with audio and text modalities enhances the accuracy and richness of video content analysis, enabling more nuanced studies in fields such as media studies, psychology, and sociology. Additionally, the model's strength in long video understanding and temporal grounding can support research into dynamic social phenomena or longitudinal studies, thus presenting a robust foundation for analyzing complex social processes over time.

I plan to experiment with InternVideo2 (clip-1B, s2-1B)’s action recognition and multimodal annotation capabilities. This will be tested on video footage of police interrogation/interviews. Specifically, the model will be tasked to produce a transcript of the text and behaviors within conversations captured in the footage. The pretrained InternVideo2 models are available on HuggingFace and accessible using an account and API token.