Story description

Please provide a high level description of the Turing Data Story A clear and concise description of what the data story is going to be about.

We are using the Spotify API for the Desert Island Discs (DID) story, and one thing that came up was how to best handle music genres. Artists can have a list of genres associated with them, some of which are really obscure (see http://everynoise.com/). When analysing the DID data based on genre, we only want to use a small number of commonly understood genres, but we also want to make sure the micro-genres are accurately rolled into meta-genres. Given a dataset of music with associated genres, how can we best create (and assign) the meta-genres?

This is potentially a mini story that could be connected to the DID story, and that we could tackle in some form at the SeptembRSE workshop.

Which datasets will you be using in this Turing Data Story? Cite the dataset that are going to be used in this story (these of course can change whilst developing the story).

The starting point for this story is the CSV of tracks that appear in the 3000+ episodes of Desert Island Discs.

Additional context Add any other context or screenshots about the story here.

I reached out to the person behind http://everynoise.com/ and he offered the following guidance: Yes, I think this case is a good example of why I don't publish a metagenre hierarchy. In any particular dataset, how things "map" has as much to do with the particular context as their abstract characteristics. E.g., K-pop "is pop", but if in your dataset nobody ever mentioned a Korean band until two years ago, and now every other list has a BTS album in it, it would probably be more interesting to call out K-pop separately.

Here's the technique I usually use in cases like this for clustering artists without giving up genre specificity:

group artists by genre and take the genre with the most artists as your first one
eliminate all the artists with that as one of their genres
repeat with the remaining artists

This usually does pretty well, because the more-general a genre (like "pop" or "rock"), the more artists it will have, so you end up with a list that starts with broad genres and moves to more-specific ones.

The advanced-mode addition to this algorithm is, in each loop, once you've got the top genre, go through and find any other genres for which at least N% of their remaining artists are also in the top genre, and roll those genres and artists into this cluster. So, e.g., if the top genre for a loop is "pop", with 100 artists, and you also have 60 "dance pop" artists of which 40 are in both "pop" and "dance pop" and 20 are unique to "dance pop", then the "pop" cluster becomes a "pop"/"dance pop" cluster with all 120, and you drop everyone with "pop" or "dance pop" for the next loop. You can play with the N to get more-aggressive or less-aggressive consolidation.

And of course you can supplement this approach by adding individual manual remappings if you want. But you may not need to. You don't need to force every album into a more-popular category in order to see trends, after all...

Ethical guideline

Ideally a Turing Data Story has these properties and follows the 5 safes framework.

[x] The analysis you produce is openly available and reproducible.
[x] Any data used are open and have an explicit licence, provenance and attribution.
[x] Any data used are not personal data (i.e. the data is anonymous or anonymised).
[x] Any linkage of datasets in your data story does not lead to an increased risk of the personal identification of individuals.
[x] The Story must be truthful and clear about any limitations of analysis (and potential biases in data).
[x] The Story will not lead to negative social outcomes, such as (but not limited to) increasing discrimination or injustice.

Current status

[x] Scoping of the story
[ ] Write story outline
[ ] Add material to the story notebook
[ ] Combine materials into a readable story
[ ] Proofread
[ ] Request reviews
[ ] Address reviews
[ ] Merge to master branch.

Updates

tracks0<-read.csv("spotify_tracks_data.csv") tracks<-tracks0[order(tracks0$date),]

getDay<-function(aDate) { day0<-as.numeric(as.POSIXct(tracks$date[1], format="%Y-%m-%d")) dayX<-as.numeric(as.POSIXct(aDate, format="%Y-%m-%d")) dayX-day0 }

getDisplayDay<-function(aDate) {

format(as.POSIXct(aDate, format="%Y-%m-%d"), "%m/%Y")

as.POSIXct(aDate, format="%Y-%m-%d") }

doAnal<-function(genre, totalEpisodes=5, doArtist=FALSE) {

if(doArtist) hasGenre<-grepl(genre, tracks$artist)1 else hasGenre<-grepl(genre, tracks$genres_artist)1 newData<-c() days<-c() #days past first episode displayDates<-c() episodeNo<-1 aSum<-0 prevEpisodeRef<-tracks$episode_ref[1] totalTracksInEpisodes<-0

for(i in 1:(length(hasGenre)+1)) { if(i == (length(hasGenre)+1) || tracks$episode_ref[i] != prevEpisodeRef) { episodeNo<-episodeNo+1 prevEpisodeRef<-tracks$episode_ref[i]

if(i == (length(hasGenre)+1) || (episodeNo-1) == totalEpisodes)
{ 
  #record new data   
  newData<-append(newData, aSum/totalTracksInEpisodes)
  days<-append(days, getDay(tracks$date[i-1]))
  displayDates<-append(displayDates, getDisplayDay(tracks$date[i-1]))

  #reset varibles
  totalTracksInEpisodes<-0  
  aSum<-0
  episodeNo<-1
}

}

totalTracksInEpisodes<-totalTracksInEpisodes + 1 aSum<-aSum + hasGenre[i]
}

as.POSIXct(tracks$date[i], format="%m/%d/%Y %H:%M:%S %p")

res<-summary(lm(newData~days)) pval<-res$coefficients[2,4] plot(days, newData, ylab=paste0("% of tracks (in ",totalEpisodes," episodes)"), xlab="date", main=paste0(genre, " (p value = ",signif(pval,3),")"), xaxt = "n", ylim=c(0,1)) axis(1, labels=format(displayDates, "%m/%Y"), at=days) abline(a=res$coefficients[1,1], b=res$coefficients[2,1], col="red") }

totalEpisodes<-10

doAnal("classical", totalEpisodes)

par(mfrow=c(3,3))

for(i in 1:9) doAnal("classical", i)

par(mfrow=c(2,3)) doAnal("classical", totalEpisodes) doAnal("rock", totalEpisodes) doAnal("jazz", totalEpisodes) doAnal("disco", totalEpisodes) doAnal("folk", totalEpisodes) doAnal("rap", totalEpisodes)

alan-turing-institute / TuringDataStories

[Mini Turning Data Story] Spotify - from microgenre to metagenre #161