Goals:
Source: We used the Spotify for Developers API to gather information on Spotify music. Here are the links to the docomentation of Spotify for Developers: https://developer.spotify.com/documentation/web-api.
Process:
Search for Playlists:
We utilized Spotify's Search API (https://developer.spotify.com/documentation/web-api/reference/search)
to search for playlists containing English songs. By setting the query to "English songs,"
we ensured that the data would be diverse and representative of different music genres.
We retrieved 50 playlists in total.
This step is handled in code/data/get_playlists.py
.
Retrieve Playlist Items:
Using the Get Playlist Items API (https://developer.spotify.com/documentation/web-api/reference/get-playlists-tracks),
we extracted all the songs from these 50 playlists. This allowed us to gather basic information for each song,
such as ID, name, release date, artists, popularity, and more. This process resulted in an initial dataset containing 5,668 tracks.
This step is handled in code/data/get_tracks.py
.
Extract Audio Features:
With the track IDs obtained from the previous step, we then used the Get Track's Audio Features API (https://developer.spotify.com/documentation/web-api/reference/get-audio-features)
to retrieve detailed audio features for each track. These features included attributes such as loudness, energy, danceability, and more,
providing a deeper understanding of the songs' characteristics. After dropping the duplicates and NA,
the dataset contains data on 4,929 tracks. We saved the data into a CSV file named spotify_data.csv
.
This step is handled in code/data/get_spotify_data.py
.
Execution method: To execute the code and get the tracks' information and audio featires (the data that you can use for further analysis), you should first get a Spotify API Client ID and Client Secret. Please follow the steps below to obtain them. Then set them as environment variables, and our code in get_spotify_data.py can help you get your access token. The steps are as follows:
git clone git@github.com:ClaireLu0608/eco395m_midterm_project.git
cd eco395m_midterm_project
pip install -r requirements.txt
cd code
cd data
.env
file. Then you can run codes in get_spotify_data.py
and produce your own data.
python3 get_spotify_data.py
Results you will get:
A CSV file named spotify_data.csv
, here.
documentation:
distribution:
python3 distribution.py
Correlation:
Determine the correlation coefficients between each music feature.
Since some of the features don't follow normal distribution,so Spearman correlation is more suitable for correlation analysis.
python3 spearman_correlation.py
loudness and energy: correlation coefficient is positive high(0.7) acousticness and energy: correlation coefficient is negative high(-0.56) valence and danceability: correlation coefficient is relatively high(0.43)
Check for Multicollinearity
Variance Inflation Factor (VIF): A common measure used to quantify how much the variance of the estimated regression coefficients is increased due to multicollinearity.
Feature | VIF |
---|---|
duration (ms) | 13.100347 |
danceability | 19.692102 |
energy | 22.023305 |
key | 3.173974 |
loudness | 9.762100 |
mode | 2.279409 |
speechiness | 2.218682 |
acousticness | 2.702970 |
instrumentalness | 1.095804 |
liveness | 2.869287 |
valence | 7.627239 |
tempo | 17.631970 |
Variables: From previous correlation results, we have left with duration_ms, speechiness, acousticness, instrumentalness, danceability, liveness, loudness, tempo, key, and mode as features and popularity as our y label. "Mode" is the only binary variable and all other variables are continous.
Data Cleaning: For model interpretation and to avoid overfitting, we excluded data points with a popularity score below 5. We applied three different data transformation methods: Log Transformation for skewed features, Standard Scaling for features that followed a normal distribution, and Min-Max Scaling for features without an obvious distribution pattern. The cleaned version is in a csv file here in the artifacts folder by running the following command:
python3 code/cleaning/data_cleaning.py
Models: To explore which features are more relevant and influential to the popularity score, we used three models for feature selection.
Results: We concluded that Mode, Tempo, instrumentalness, and acoutisticness are the less influential features to popularity score, since at least two of the three models excluded them from the subset of important features.
Execution method:
python3 code/models/random_forest_feature_importance.py
Top Songs vs. Regular Songs feature analysis
Objective: To examine how the audio features of the top 10% of most popular songs (by some ranking metric) differ from other tracks.
Key Insights:
Result:
Trends in Audio Features Over Time (1980-2024)
Objective: To identify patterns and trends in the evolution of audio features in songs over time, starting from 1980.
Key Insights:
Result:
Data Limitations:
The "popularity" feature in our dataset is calculated by Spotify using their own algorithm. This algorithm have factors in the total number of plays a track has received and how recent those plays are. However, we don't know the exact details of the algorithm, and we cannot access the actual play count or the number of people who have saved a track through the API. Additionally, some Audio Features, like Danceability and Energy, are actually scores that Spotify assigns to each track. We cannot determine the exact criteria for these scores or the detailed steps used to calculate them.
Model Limitations:
Besides doing feature selection, models are unable to perfectly predict popularity scores based solely on audio features. One of the main reasons is that Spotify calculates these scores using some features that are not publicly available, making it challenging to build accurate machine learning models. This limitation also introduces bias in feature selection and leads to some variability in the results, as the models may not be able to capture enough patterns using audio features only.
Feature Selection Bias:
The analysis focuses on a predefined set of audio features (e.g., danceability, tempo, loudness), which are curated by Spotify's algorithms. The accuracy and interpretation of these features might not be consistent across all music genres or eras. Some nuances, such as musical innovations or genre crossovers, might not be captured by these specific audio features.
Noise and Variability Over Time:
The temporal analysis shows a lot of variance in certain features like tempo and liveness. This noise could be due to random fluctuations in the dataset or genre shifts that were not fully addressed in the analysis. A more refined analysis could control for genre or other contextual factors to reduce noise and improve the reliability of the findings.
Data Collection:
Some methods can be developed to obtain the actual play counts for each track and how many people have saved them. Additionally, we can retrieve the genre tags for each track and analyze which features influence the popularity of tracks within specific genres, making the analysis more targeted.
Genre-Specific Analysis:
A genre-based comparison could shed light on whether the trends observed in the top 10% songs are genre-specific or universal. Some audio features, like instrumentalness or tempo, may have different influences depending on whether the song is pop, rock, hip-hop, electronic, etc. Splitting the data by genre and conducting genre-specific case studies could lead to more accurate insights.
Prediction and Coefficient Estimation:
If additional features, such as the total number of plays per track and recent plays, can be obtained through the Developer API or web scraping, a prediction task can be performed to develop machine learning models for predicting popularity scores. Alternatively, if all confounding variables (i.e., all covariates that influence the popularity score) are presented, unbiased estimation of the coefficients using causal inference methods can be applied to assess the actual impact of each feature on the popularity score.