ClaireLu0608 / Analysis-of-Spotify-Audio-Features-and-Popularity

0 stars 0 forks source link

Unraveling the impact of Audio Features on Spotify Popularity Score

Goals:

A.Data Collection

Source: We used the Spotify for Developers API to gather information on Spotify music. Here are the links to the docomentation of Spotify for Developers: https://developer.spotify.com/documentation/web-api.

Process:

Execution method: To execute the code and get the tracks' information and audio featires (the data that you can use for further analysis), you should first get a Spotify API Client ID and Client Secret. Please follow the steps below to obtain them. Then set them as environment variables, and our code in get_spotify_data.py can help you get your access token. The steps are as follows:

Results you will get: A CSV file named spotify_data.csv, here.

B.Data Overview

documentation:

  1. id: The Spotify ID for the album
  2. name:The name of the album
  3. release date:The date the album was first released
  4. artists: The artists of the album
  5. duration (ms):The track length in milliseconds
  6. popularity:The popularity of the track. The value will be between 0 and 100.
  7. danceability:How suitable a track is for dancing based on a combination of musical elements.
  8. energy:A measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity.
  9. key:The key the track is in.
  10. loudness:The overall loudness of a track in decibels (dB).
  11. mode:Mode indicates the modality of a track, the type of scale from which its melodic content is derived.
  12. speechiness:Speechiness detects the presence of spoken words in a track.
  13. acousticness:A confidence measure from 0.0 to 1.0 of whether the track is acoustic.
  14. instrumentalness:Predicts whether a track contains no vocals.
  15. liveness:Detects the presence of an audience in the recording.
  16. valence:A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track.
  17. tempo:The overall estimated tempo of a track in beats per minute (BPM).

distribution:

python3 distribution.py

distribution

  1. The distribution of duration(ms), danceability, energy, loudness, liveness,valence and tempo are more closely follows a normal distribution
  2. Duration: The distribution of duration shows a somewhat unimodal pattern with most songs having durations between 200,000 and 400,000 milliseconds (or roughly 3 to 6 minutes). The peak occurs around 300,000 ms (5 minutes).
  3. Danceability: Danceability appears to have a normal distribution, centered around 0.6. This suggests that most songs have moderate danceability, with fewer songs on the extreme ends (low or high).
  4. Energy: The energy feature follows a right-skewed distribution, with a large number of songs having energy levels between 0.6 and 0.8. This suggests that many songs have relatively high energy.
  5. Key: The distribution of key shows a somewhat uniform pattern, meaning songs are evenly spread across different keys, though certain keys appear more frequently than others.
  6. Loudness: Loudness has a left-skewed distribution, with most songs clustered between -10 dB and 0 dB, indicating relatively high loudness for the majority of the tracks.
  7. Mode: The number of songs have mode 1 is more than songs which have mode 0.
  8. Speechiness: Speechiness is heavily skewed to the left, meaning most songs have very low speech content. Only a few songs show higher speechiness values.
  9. Acousticness: Many songs having very low acoustic characteristics. This indicates that most songs are electronically produced.
  10. Instrumentalness: Most songs include vocals and are not purely instrumental.
  11. Liveness: Most songs have a low live performance aspect, though a small number of songs feature higher liveness values, possibly indicating live recordings.
  12. Valence: Valence is fairly evenly distributed, with songs spread across the full range of valence from 0 to 1, though there is a slight peak around 0.5, indicating that many songs have a neutral emotional tone.
  13. Tempo: The tempo distribution is multimodal, with several peaks indicating that songs tend to cluster around certain common tempo ranges (such as 60-75 bpm, 120-140 bpm).
  14. Year: The distribution of the release year shows a steep rise from the 1960s onwards, with a noticeable peak around 2020. This suggests that the dataset contains a larger number of recent tracks.

C.Correlation Analysis

Correlation:

loudness and energy: correlation coefficient is positive high(0.7) acousticness and energy: correlation coefficient is negative high(-0.56) valence and danceability: correlation coefficient is relatively high(0.43)

Feature VIF
duration (ms) 13.100347
danceability 19.692102
energy 22.023305
key 3.173974
loudness 9.762100
mode 2.279409
speechiness 2.218682
acousticness 2.702970
instrumentalness 1.095804
liveness 2.869287
valence 7.627239
tempo 17.631970

D.Models and Result

Variables: From previous correlation results, we have left with duration_ms, speechiness, acousticness, instrumentalness, danceability, liveness, loudness, tempo, key, and mode as features and popularity as our y label. "Mode" is the only binary variable and all other variables are continous.

Data Cleaning: For model interpretation and to avoid overfitting, we excluded data points with a popularity score below 5. We applied three different data transformation methods: Log Transformation for skewed features, Standard Scaling for features that followed a normal distribution, and Min-Max Scaling for features without an obvious distribution pattern. The cleaned version is in a csv file here in the artifacts folder by running the following command:

python3 code/cleaning/data_cleaning.py

Models: To explore which features are more relevant and influential to the popularity score, we used three models for feature selection.

Results: We concluded that Mode, Tempo, instrumentalness, and acoutisticness are the less influential features to popularity score, since at least two of the three models excluded them from the subset of important features.

Execution method:

python3 code/models/random_forest_feature_importance.py

E.Case Study

Top Songs vs. Regular Songs feature analysis

Objective: To examine how the audio features of the top 10% of most popular songs (by some ranking metric) differ from other tracks.

Key Insights:

Result:

Trends in Audio Features Over Time (1980-2024)

Objective: To identify patterns and trends in the evolution of audio features in songs over time, starting from 1980.

Key Insights:

Result:

F.Reproducibility

The "popularity" feature in our dataset is calculated by Spotify using their own algorithm. This algorithm have factors in the total number of plays a track has received and how recent those plays are. However, we don't know the exact details of the algorithm, and we cannot access the actual play count or the number of people who have saved a track through the API. Additionally, some Audio Features, like Danceability and Energy, are actually scores that Spotify assigns to each track. We cannot determine the exact criteria for these scores or the detailed steps used to calculate them.

Model Limitations:

Besides doing feature selection, models are unable to perfectly predict popularity scores based solely on audio features. One of the main reasons is that Spotify calculates these scores using some features that are not publicly available, making it challenging to build accurate machine learning models. This limitation also introduces bias in feature selection and leads to some variability in the results, as the models may not be able to capture enough patterns using audio features only.

Feature Selection Bias:

The analysis focuses on a predefined set of audio features (e.g., danceability, tempo, loudness), which are curated by Spotify's algorithms. The accuracy and interpretation of these features might not be consistent across all music genres or eras. Some nuances, such as musical innovations or genre crossovers, might not be captured by these specific audio features.

Noise and Variability Over Time:

The temporal analysis shows a lot of variance in certain features like tempo and liveness. This noise could be due to random fluctuations in the dataset or genre shifts that were not fully addressed in the analysis. A more refined analysis could control for genre or other contextual factors to reduce noise and improve the reliability of the findings.

H.Further Improvements

Data Collection:

Some methods can be developed to obtain the actual play counts for each track and how many people have saved them. Additionally, we can retrieve the genre tags for each track and analyze which features influence the popularity of tracks within specific genres, making the analysis more targeted.

Genre-Specific Analysis:

A genre-based comparison could shed light on whether the trends observed in the top 10% songs are genre-specific or universal. Some audio features, like instrumentalness or tempo, may have different influences depending on whether the song is pop, rock, hip-hop, electronic, etc. Splitting the data by genre and conducting genre-specific case studies could lead to more accurate insights.

Prediction and Coefficient Estimation:

If additional features, such as the total number of plays per track and recent plays, can be obtained through the Developer API or web scraping, a prediction task can be performed to develop machine learning models for predicting popularity scores. Alternatively, if all confounding variables (i.e., all covariates that influence the popularity score) are presented, unbiased estimation of the coefficients using causal inference methods can be applied to assess the actual impact of each feature on the popularity score.