Week 7. May. 3: Sound & Image Learning - Possibilities

Thinking-with-Deep-Learning-Spring-2024 / Readings-Responses

You can post your reading responses in this repository.

0 stars 1 forks source link

Week 7. May. 3: Sound & Image Learning - Possibilities #14

Open JunsolKim opened 8 months ago

JunsolKim commented 8 months ago

Pose a question about one of the following articles:

“Machine Learning as a Tool for Hypothesis Generation”, Jens Ludwig, Sendhil Mullainathan. The Quarterly Journal of Economics 2024.

“Machine learning approaches to facial and text analysis: Discovering CEO oral communication styles” 2019. P. Choudhury, D. Wang, N. Carlson, T. Khanna. Strategic Management Journal 40(11):1705-1732.

Also see: “Image Representations Learned with Unsupervised Pre-Training Contain Human-Like Biases.” 2021. “Towards real-time photorealistic 3D holography with deep neural networks” (2021) “Sixteen facial expressions occur in similar contexts worldwide” (2020) “Using deep learning and Google Street View to estimate the demographic makeup of neighborhoods across the United States” (2017) “Computer vision uncovers predictors of physical urban change.” 2017. “Nrityabodha: Towards understanding Indian classical dance using a deep learning approach”. 2016. “Galaxies, Human Eyes, and Artificial Neural Networks” (1995)

maddiehealy commented 6 months ago

300-400 Summary: I wanted to select an article that discussed sound learning/generating this week. I landed on an article published in November of 2023 titled Child-to-Adult Voice Style Transfer, which examines the application of current voice style transfer models to children's voices—a previously unexplored area. The study demonstrates that while these models handle adult voices well, they face significant challenges with child voices. This insight suggests potential advancements for accessibility, voice style transfer, and audio preservation within deep learning.

Voice style transfer technology has extensive applications, from enhancing privacy and security by disguising voices to improving voice acting in the entertainment industry. The models tested, such as AutoVC, were primarily developed for adults, leaving a gap in their applicability to children's voices. The research employed three methods for child-to-adult voice conversion: traditional voice cloning, zero-shot AutoVC, and many-to-many voice style transfer with AutoVC trained on child-and-adult data. The voice cloning method preserved speech content but struggled with style transformation, whereas both AutoVC methods excelled in style transformation at the expense of losing the clarity and comprehensibility of the speech content.

These results highlight a crucial challenge: balancing content preservation with effective style transformation, especially considering the differences in speech patterns like pitch, cadence, and rhythm between children and adults. There is a pressing need to develop new models or adapt existing ones to handle the unique characteristics of children's speech more effectively.

For social science applications, this technology could revolutionize studies on communication dynamics and age-related biases. For example, altering the perceived age of a speaker without changing their speech content could help researchers study how perceived age affects authority, credibility, and empathetic responses. Implementing such studies would involve collecting diverse speech recordings from children and adults, maintaining consistent content across voice ages. Participants would provide feedback on these recordings through surveys or interviews, offering insights into their perceptions and decision-making processes. Collaborating with AI specialists to refine voice transformation techniques will be essential, pushing the boundaries of our understanding of social interactions and communication.

Pei0504 commented 6 months ago

The paper “Image Representations Learned with Unsupervised Pre-Training Contain Human-Like Biases.” discusses the presence of biases in unsupervised learning models. What are the current methods or techniques used to correct or mitigate these biases in image representation models? How effective are these techniques in practice, especially when dealing with complex biases like intersectionality? Considering that biases are embedded in unsupervised image models as shown in this study, what could be the long-term effects on societal perceptions if these biased models are widely used in applications like surveillance, advertising, or social media?

kceeyang commented 6 months ago

300-400 word reflection:

After reading Chapter 20 on the DDPM, I found the paper “Improved Denoising Diffusion Probabilistic Models,” which suggested some modifications to the DDPM to obtain better log-likelihood and achieve high sample quality at the same time. In particular, its emphasis on the noise schedule also helped to clear up my confusion about the use of schedulers, which I addressed in the orientation reading post. In this article, the authors compared the noise schedule setup from the original DDPM paper with their own noise schedule design, which shows that the variance schedule can be of different types, such as linear (in the original paper) and cosine (in this paper). The authors found that the linear noise scheduler used in the original design suited well only for high-quality images but is suboptimal for 64 × 64 and 32 × 32 resolution images. Also, based on the figure they are showing, the latent samples used with the linear scheduler are almost purely noise in the last quarter of the forward diffusion process. This visualization of adding noise to the image using the linear scheduler indicated a loss of information from the original image at the end of the forward noising process, so it would not be too valuable to contribute to sample quality.

Thus, the authors proposed their cosine scheduler as a potential solution to the aforementioned issues. In their design, the potentially learnable coefficient that can vary with the hierarchical depth t was linearly reduced in the middle of the forward noising process. They also clipped the variance parameter to be no greater than 0.999 to prevent the terminal signal-to-noise ratio from reaching zero and made minimal alterations to the extremes of t = 0 and t = T to avoid sudden changes in noise level. As a result, the cosine scheduler adds noise and destroys information at a slower rate than the linear scheduler, offering a more gradual and controlled process.

Considering the improvements made to the DDPM with the introduction of the cosine noise scheduler, they can be applied to various social science research projects with various quality image datasets. This enhanced model promises a more effective noising process and a reduction in the number of steps required in the forward diffusion process, thereby increasing efficiency and improving results.

HongzhangXie commented 6 months ago

In "Machine learning approaches to facial and text analysis: Discovering CEO oral communication styles," the authors coded and analyzed interview text sentiment and static facial images to predict the likelihood of business mergers and acquisitions, as well as corporate growth.

This is a very interesting study. I noticed that the authors conducted independent analyses of both text and facial images. However, for video data, we can simultaneously observe the CEO's facial expressions when uttering specific statements. Could constructing a deep learning model that combines text and facial data based on the video timeline improve predictive ability? Additionally, with video data, we can analyze more aspects such as the CEO's tone of voice during conversation and the dynamic changes in facial expressions. I believe this would be very intriguing.

guanhongliu2000 commented 6 months ago

I would recommend the article Image and audio caps: automated captioning of background sounds and images using deep learning written by M. Poongodi, Mounir Hamdi and Huihui Wang in 2023.

This article presents a significant contribution to the field of Sound & Image Learning by demonstrating an innovative approach to automated captioning of background sounds and images using deep learning technologies. This topic is of particular relevance for several reasons that make the article an excellent study subject for those interested in multimedia systems, computer vision, and machine learning applications.

Firstly, the interdisciplinary nature of the study, combining elements from both image and sound analysis, is crucial as it reflects the growing trend in AI and machine learning towards creating more holistic and integrated systems. Traditional methods often tackled these domains separately, but the ability of the proposed model to simultaneously interpret and generate descriptions for both images and sounds represents a leap forward. This not only showcases the advancements in deep learning architectures—such as the integration of CNNs and RNNs—but also highlights the article's potential applications in creating more accessible technologies for the visually impaired, thereby underscoring its social impact.

The methodology described in the article, where two specialized models trained extensively on large datasets are combined, offers a practical framework for achieving high accuracy in automated caption generation. The Top 5 and Top 1 accuracy rates reported (67% and 53%, respectively) are impressive and suggest that the system is relatively robust. This provides a strong foundation for further research, as these methods can be refined and potentially applied in other contexts, such as real-time interpretation for live events or enhancing user interactions with digital platforms through more immersive and contextually aware media.

Moreover, the article's exploration into the use of in-the-wild sound data to train models introduces an innovative approach to dealing with the scarcity of labeled sound datasets, which are often expensive and difficult to produce. By utilizing unlabeled video data, the researchers demonstrate a cost-effective method of training that could democratize sound recognition technology, making it more accessible to researchers with limited resources.

Lastly, the article is well-positioned within the broader research landscape, providing a thorough review of related works while also identifying the gaps that their model addresses. This not only situates the paper well academically but also offers readers a comprehensive overview of the field's current state and its future directions.

Studying this article allows learners and researchers to engage deeply with cutting-edge techniques in AI, appreciate the complexities of integrating different sensory inputs in machine learning, and understand the practical challenges and potential societal benefits of such technologies. As such, it serves as both an educational resource and a source of inspiration for innovative future work in multimedia systems.

XueweiLi1027 commented 6 months ago

For this week's reading, I would recommend Towards recognizing facial expressions at deeper level: Discriminating genuine and fake smiles from a sequence of images

Reflection The article addresses the challenge of discerning genuine emotions from posed ones, particularly focusing on smiles. It underscores the importance of facial expression recognition (FER) in understanding human emotions, with applications across various fields. The authors highlight that while significant progress has been made in FER, differentiating between deep or real emotions, such as a spontaneous smile versus a frustrated one, remains complex. The paper presents a deep learning approach using bidirectional LSTM with attention mechanism to analyze facial expressions in video sequences. The proposed model achieved 98% accuracy on the MAHNOB database and was tested on SPOS and MMI databases, yielding 87% and 97% accuracy, respectively.

The method presented in the article could extend social science analysis by providing a tool to study non-verbal communication and emotional expressions in social interactions more accurately. For instance, it could be used to analyze the authenticity of emotions expressed in political debates, interviews, or public speeches. Additionally, understanding genuine versus fake smiles can offer insights into areas such as marketing, where customer satisfaction is often assessed through feedback, including facial expressions. This method could also enhance the study of mental health by helping to detect subtle cues in patients' expressions that may indicate their emotional state.

To pilot the use of the proposed method, researchers could focus on a specific domain where emotional expression analysis is critical. For example, in political science, analyzing the authenticity of candidates' expressions during debates could provide insights into their emotional communication skills. The social data for this pilot would include video recordings of political debates, interviews, and public speeches of candidates.

Question How did the authors ensure that the deep learning models were effectively capturing the nuances of genuine versus fake smiles, considering the subtlety of the differences and the potential for overfitting on the training data? Moreover, How do the findings contribute to our understanding of emotional expression authenticity? What are the potential implications for applications in fields such as psychology, human-computer interaction, and digital media analysis?

risakogit commented 6 months ago

In "Machine Learning Approaches to Facial and Text Analysis: Discovering CEO Oral Communication Styles," the researchers acknowledge that one of the limitations of their research is its generalizability. Could conducting longitudinal studies that track CEOs' communication styles improve generalizability? Or is it the case that as long as you are analyzing the same individuals, significant differences in communication styles should not be expected?

uc-diamon commented 6 months ago

In regards to "Using deep learning and Google Street View to estimate the demographic makeup of neighborhoods across the United States", what are some potential dangers from shifting from ACS methodology to the Google Street View methodology?

anzhichen1999 commented 6 months ago

The study from Naik applies a new computer vision method to measure the physical changes in urban landscapes using street-level imagery, involving the computation of a metric known as 'Streetscore', which quantifies the perceived safety of streetscapes based on visual features derived from images captured through Google Street View. The algorithm processes images by segmenting them into categories (ground, buildings, trees, sky) and calculating texture and structure features to evaluate the aesthetics and safety of the streetscapes. The researchers applied this methodology to a longitudinal dataset of images from five major U.S. cities, captured in 2007 and 2014. They calculated 'Streetchange' as the difference in Streetscore between these two years, indicating improvements or declines in neighborhood appearance. The study linked these visual assessments to demographic and economic data, exploring factors that predict physical improvements in urban neighborhoods.

The results reveal three main predictors of neighborhood improvement:

Higher density of college-educated residents, supporting theories that relate human capital to urban success. Better initial appearance, aligning with tipping models that suggest neighborhoods on a positive trajectory tend to continue improving. Proximity to the city center and other attractive neighborhoods, corroborating invasion theories from urban sociology that suggest improvements spread from central and appealing areas outward. These findings underscore the utility of integrating computer vision with traditional urban studies methods to analyze the dynamics of urban change, providing empirical support for classic theories of urban development.

With the use of computer vision to analyze urban transformation, how might reinforcement learning algorithms be applied to optimize the planning and development of urban spaces based on predictive analytics derived from visual data?

CYL24 commented 6 months ago

300-400 reflection:

I would recommend the article: "Rethinking and Improving the Robustness of Image Style Transfer." After reading Chapter 10, I would like to further explore image style transfer. This research article sheds light on the architectural factors affecting neural style transfer performance using different convolutional network architectures. By uncovering the impact of residual connections on stylization quality and proposing a practical solution (Stylization With Activation smoothinG (SWAG)) to mitigate their effects, this work contributes to improving the robustness and effectiveness of style transfer algorithms across diverse network architectures.

Pre-trained VGG networks has proved capability of capturing the visual style of images. However, this stylization quality often degrades significantly when applied to features from more advanced networks like ResNet. In the article, it was discovered that residual connections, a key architectural difference between VGG and ResNet, produce feature maps with low entropy, which are not ideal for style transfer. To address this, a simple yet effective solution based on a softmax transformation of feature activations, SWAG, was proposed to enhance their entropy. This small adjustment greatly improved stylization results, even for networks with random weights, indicating that the architecture used for feature extraction is more critical than the learned weights for style transfer tasks.

SWAG aims to enhance style transfer quality by adding activation smoothing to the loss functions used for stylization, thereby eliminating the dependency on the VGG architecture for robust performance. In the article, researchers focus on assessing SWAG's effectiveness in enhancing style transfer performance across various network architectures. SWAG, compared with standard stylization for non-VGG architectures like Inception-v3 and Wide ResNet (WRN), reveals improvements in transferring high-level style features. SWAG implementations of different stylization algorithms also show significant performance boosts, especially for random models like ResNet, which previously struggled with stylization.

As examples, the article goes through a user study on Amazon Mechanical Turk, which confirms SWAG's superiority over standard VGG implementation, indicating its ability to eliminate dependency on VGG architecture for stylization. Another ablation studies demonstrates SWAG's effectiveness in improving image reconstruction and texture synthesis, highlighting its capability to match styles at deeper layers and enhance reconstruction and texture synthesis quality.

Overall, this article and the solution it proposed (SWAG) present a simple yet effective solution to the lack of robustness in stylization algorithms for non-VGG architectures, making lightweight models viable alternatives to VGG for future stylization work. It not only help researchers and learners have a better understanding of the architectural factors influencing neural style transfer, it also paves the way for broader adoption of lightweight models in stylization tasks, reducing reliance on the VGG architecture.

Marugannwg commented 6 months ago

I'd like to share some thriving use cases of the GAN model. There are many people in the open-source community trying to clone the artistic style of visual arts and singers. One task among them is the Singing Voice Conversion (SVC) -- literally, giving a song and some sample of another target singer's voice, the goal is to covert the singer sound to your target singer. Early development started with creating singer embedding and using GAN.

Link1 and Link2 to papers introducing the recent SVC competition]

Link to a workable so-vits-svc

Xtzj2333 commented 6 months ago

“Machine Learning as a Tool for Hypothesis Generation”

This paper uses GANs to generate hypotheses from judicial facial recognition data. Could this methodology be adapted to analyse and generate hypotheses for judicial audio recordings? Can it be adapted to any other forms of high-dimensional data?

HamsterradYC commented 6 months ago

I would like to recommend the article Multi-Level Neural Scene Graphs for Dynamic Urban" by Tobias Fischer et al. presents a novel approach for modeling and rendering dynamic urban scenes using neural radiance fields. The authors introduce a multi-level neural scene graph representation, allowing them to capture large-scale environments with varying dynamic objects and environmental conditions. The approach efficiently handles sequences of images captured from moving vehicles and develops a fast composite ray sampling and rendering scheme for training and inference. The paper also introduces a benchmark for evaluating novel view synthesis in dynamic urban settings and demonstrates that their method outperforms previous works in both training speed and view synthesis quality.

The method presented in this paper could extend social science analysis, particularly in areas such as urban studies, environmental sociology, and public health. By modeling urban environments in 3D with dynamic elements, researchers could analyze how environmental factors, such as lighting, traffic, and pollution, affect human behavior and health. This neural radiance field approach can help create realistic simulations for studying the effects of various urban conditions on different populations.

To pilot this approach, we could use geotagged social media data combined with municipal environmental data. Social media platforms such as Flickr and Weibo provide rich datasets of geotagged posts that can be analyzed to understand human activities and sentiments in different urban environments. Municipalities often have detailed environmental data, such as pollution levels, noise, and traffic patterns, which can be overlaid with social media data.

For example, to study how pollution affects urban life, we can collect tweets from various city areas and match them with pollution data from environmental sensors. This data can then be fed into the multi-level neural scene graph to visualize how pollution levels correlate with human activities and sentiments. The resulting visualization can help researchers understand the spatial and temporal patterns of urban life and inform policy decisions to improve urban living conditions.

Brian-W00 commented 6 months ago

How does the methodology used in the paper "Using Deep Learning and Google Street View to Estimate the Demographic Makeup of the US" ensure the accuracy and reliability of demographic predictions based on street view images, and what are the ethical considerations associated with this approach?

kangyic commented 6 months ago

MACHINE LEARNING AS A TOOL FOR HYPOTHESIS GENERATION What are some key challenges or considerations in using machine learning for hypothesis generation, particularly when applied to complex datasets like those mentioned in the article (e.g., cell phone data, satellite imagery, online behavior)?

hantaoxiao commented 6 months ago

In the context of the discussions around the innovative use of deep learning in various fields, particularly the papers discussing sound and image learning, an intriguing question arises:

How might deep learning techniques that focus on sound and image transformations be applied to enhance real-time translation systems for both verbal and non-verbal communication in international diplomatic engagements?

beilrz commented 6 months ago

Easily Accessible Text-to-Image Generation Amplifies Demographic Stereotypes at Large Scale

This paper discusses the bias in popular text-to-image models, such as DALL-E or stable diffusion. The author found that these biases are profound and difficult to mitigate despite various types of effort. I was wondering what future research should be done to address racial or gender bias in the models. Furthermore, how should we develop an empirical model to measure the bias existed in these models?

00ikaros commented 6 months ago

What is the proposed systematic procedure for generating novel hypotheses about human behavior using machine learning algorithms, and how does it differ from traditional, informal hypothesis generation? In the context of judge decisions about whom to jail, how does the procedure leverage the capacity of machine learning to identify patterns from high-dimensional data, such as a defendant’s mug shot, to generate hypotheses that are both interpretable and novel? Additionally, how can this approach be generalized to produce hypotheses from other high-dimensional data sets, and what implications does this have for advancing the “prescientific” stage of science?

Carolineyx commented 6 months ago

For this week, I would like to recommend:"Deep Neural Network Decodes Aspects of Stimulus-Intrinsic Memorability Inaccessible to Humans"

Summary:

The article investigates the capability of a deep neural network, ResMem, to predict the memorability of visual stimuli based on intrinsic properties. The study demonstrates that while humans have partial access to the properties that make stimuli memorable, ResMem can decode aspects of memorability that are inaccessible to humans. The research involved three experiments that confirmed ResMem's ability to predict memorability independent of extrinsic factors such as interstimulus similarity. The findings highlight the multifaceted nature of memorability and the potential of using deep neural networks to uncover elements that influence memory encoding success.

Extending Social Science Analysis:

The methodology described in the article can significantly extend social science analysis, particularly in understanding the cognitive and perceptual factors influencing memory. By utilizing a deep neural network like ResMem, researchers can delve into the intrinsic properties of stimuli that affect memory retention and recall. This approach can be applied to various domains within social science, such as educational psychology, where understanding the memorability of educational materials can enhance learning strategies. Additionally, ResMem can be employed to study the impact of visual media on public memory and recall, providing insights into how visual content can shape collective memory and public opinion.

Pilot Use of Social Data:

To pilot the use of ResMem in extending social science analysis, I propose a study focusing on the memorability of public health campaign visuals. The social data required would include:

Visual Content: Images and videos used in public health campaigns. Demographic Information: Age, gender, education level, and occupation of the target audience. Engagement Data: Measures of engagement with the campaign materials, such as likes, shares, comments, and view duration. Recall Data: Surveys and tests assessing the recall and retention of campaign messages among the audience.

By inputting this data into ResMem, we can generate predictions on which visual elements of the campaign materials are most memorable. These predictions can help public health officials design more effective campaigns by emphasizing elements that enhance memorability and, consequently, the retention of critical health information.

erikaz1 commented 5 months ago

New possibility reading: "Embodied Voice and AI: a Techno-Social System in Miniature" by Serbanescu et al. (2024) explores the integration of embodied knowledge with AI. The project investigates the augmentation of performative practice through an AI wearable device, which is designed to map bodily movements to synthesized sounds. This multimodal methodological framework involves several major parts: contemporary performance (the application context in this case), wearable technology design, Human Computer Interaction (HCI), and pose-to-sound models. The research team chose a special “post-Grotowskian” method for actor training, which helps participants connect their inner impulses and outer reactions seamlessly.

The AI wearable device is a collar with sensors. It captures body movements and maps them to sound cues using machine learning. This device, designed by Satomi, enables real-time motion capture and sound synthesis and requires continuous recalibration due to changes in the collar sensors. The project focuses on specific physical actions, such as "push," "pull," "caress," and "bounce," which trains AI models to recognize and react to these movements accordingly. While these physical actions are still limited in scope and thus may not be very useful outside of the lab or performance-art context, I wonder if such technology would be useful someday for helping people increase their sensitivity or awareness of various environmental stimuli, physical actions, or other experiential states. This generalization is inspired by the claim that “The intention is to observe how this engagement with the AI-based system could become conducive to moments of flow, or how it stimulates in-the-moment awareness in the performer” (88), and a guess at the ways we might possibly learn from these exciting but highly limited and controlled initial use cases.

This interdisciplinary project investigates many concepts that would interest social scientists, such as embodied knowledge integration, materiality, power structures, and interactive storytelling. It can extend social science analysis by providing a framework for studying human-AI interactions in novel ways. The authors’ emphasis on critical self-reflexivity and the "ethics of care" can help social scientists explore the socio-technical dynamics of AI in society by focusing on new insights on communication and the process of co-constructing knowledge. This can lead to a deeper understanding of how AI systems affect and are affected by social contexts and enhance our understanding of techno-social systems.