alan-turing-institute / TuringDataStories

TuringDataStories: An open community creating “Data Stories”: A mix of open data, code, narrative 💬, visuals 📊📈 and knowledge 🧠 to help understand the world around us.
Other
39 stars 12 forks source link

[Mini Turning Data Story] Spotify - from microgenre to metagenre #161

Open billfinnegan opened 3 years ago

billfinnegan commented 3 years ago

Story description

Please provide a high level description of the Turing Data Story A clear and concise description of what the data story is going to be about.

We are using the Spotify API for the Desert Island Discs (DID) story, and one thing that came up was how to best handle music genres. Artists can have a list of genres associated with them, some of which are really obscure (see http://everynoise.com/). When analysing the DID data based on genre, we only want to use a small number of commonly understood genres, but we also want to make sure the micro-genres are accurately rolled into meta-genres. Given a dataset of music with associated genres, how can we best create (and assign) the meta-genres?

This is potentially a mini story that could be connected to the DID story, and that we could tackle in some form at the SeptembRSE workshop.

Which datasets will you be using in this Turing Data Story? Cite the dataset that are going to be used in this story (these of course can change whilst developing the story).

The starting point for this story is the CSV of tracks that appear in the 3000+ episodes of Desert Island Discs.

Additional context Add any other context or screenshots about the story here.

I reached out to the person behind http://everynoise.com/ and he offered the following guidance: Yes, I think this case is a good example of why I don't publish a metagenre hierarchy. In any particular dataset, how things "map" has as much to do with the particular context as their abstract characteristics. E.g., K-pop "is pop", but if in your dataset nobody ever mentioned a Korean band until two years ago, and now every other list has a BTS album in it, it would probably be more interesting to call out K-pop separately.

Here's the technique I usually use in cases like this for clustering artists without giving up genre specificity:

This usually does pretty well, because the more-general a genre (like "pop" or "rock"), the more artists it will have, so you end up with a list that starts with broad genres and moves to more-specific ones.

The advanced-mode addition to this algorithm is, in each loop, once you've got the top genre, go through and find any other genres for which at least N% of their remaining artists are also in the top genre, and roll those genres and artists into this cluster. So, e.g., if the top genre for a loop is "pop", with 100 artists, and you also have 60 "dance pop" artists of which 40 are in both "pop" and "dance pop" and 20 are unique to "dance pop", then the "pop" cluster becomes a "pop"/"dance pop" cluster with all 120, and you drop everyone with "pop" or "dance pop" for the next loop. You can play with the N to get more-aggressive or less-aggressive consolidation.

And of course you can supplement this approach by adding individual manual remappings if you want. But you may not need to. You don't need to force every album into a more-popular category in order to see trends, after all...

Ethical guideline

Ideally a Turing Data Story has these properties and follows the 5 safes framework.

Current status

Updates

drhowey commented 3 years ago

Here are a few plots I have done: plots

drhowey commented 3 years ago

tracks0<-read.csv("spotify_tracks_data.csv") tracks<-tracks0[order(tracks0$date),]

getDay<-function(aDate) { day0<-as.numeric(as.POSIXct(tracks$date[1], format="%Y-%m-%d")) dayX<-as.numeric(as.POSIXct(aDate, format="%Y-%m-%d")) dayX-day0 }

getDisplayDay<-function(aDate) {

format(as.POSIXct(aDate, format="%Y-%m-%d"), "%m/%Y")

as.POSIXct(aDate, format="%Y-%m-%d") }

doAnal<-function(genre, totalEpisodes=5, doArtist=FALSE) {

if(doArtist) hasGenre<-grepl(genre, tracks$artist)1 else hasGenre<-grepl(genre, tracks$genres_artist)1 newData<-c() days<-c() #days past first episode displayDates<-c() episodeNo<-1 aSum<-0 prevEpisodeRef<-tracks$episode_ref[1] totalTracksInEpisodes<-0

for(i in 1:(length(hasGenre)+1)) { if(i == (length(hasGenre)+1) || tracks$episode_ref[i] != prevEpisodeRef) { episodeNo<-episodeNo+1 prevEpisodeRef<-tracks$episode_ref[i]

if(i == (length(hasGenre)+1) || (episodeNo-1) == totalEpisodes)
{ 
  #record new data   
  newData<-append(newData, aSum/totalTracksInEpisodes)
  days<-append(days, getDay(tracks$date[i-1]))
  displayDates<-append(displayDates, getDisplayDay(tracks$date[i-1]))

  #reset varibles
  totalTracksInEpisodes<-0  
  aSum<-0
  episodeNo<-1
}

}

totalTracksInEpisodes<-totalTracksInEpisodes + 1 aSum<-aSum + hasGenre[i]
}

as.POSIXct(tracks$date[i], format="%m/%d/%Y %H:%M:%S %p")

res<-summary(lm(newData~days)) pval<-res$coefficients[2,4] plot(days, newData, ylab=paste0("% of tracks (in ",totalEpisodes," episodes)"), xlab="date", main=paste0(genre, " (p value = ",signif(pval,3),")"), xaxt = "n", ylim=c(0,1)) axis(1, labels=format(displayDates, "%m/%Y"), at=days) abline(a=res$coefficients[1,1], b=res$coefficients[2,1], col="red") }

totalEpisodes<-10

doAnal("classical", totalEpisodes)

par(mfrow=c(3,3))

for(i in 1:9) doAnal("classical", i)

par(mfrow=c(2,3)) doAnal("classical", totalEpisodes) doAnal("rock", totalEpisodes) doAnal("jazz", totalEpisodes) doAnal("disco", totalEpisodes) doAnal("folk", totalEpisodes) doAnal("rap", totalEpisodes)