alan-turing-institute / TuringDataStories

TuringDataStories: An open community creating “Data Stories”: A mix of open data, code, narrative 💬, visuals 📊📈 and knowledge 🧠 to help understand the world around us.
Other
39 stars 12 forks source link

[Turing Data Story] Desert Island Discs - UK notables and their tunes #124

Closed billfinnegan closed 2 years ago

billfinnegan commented 3 years ago

Story description

Please provide a high level description of the Turing Data Story Since 1942, notable people have gone on the BBC Radio 4 programme Desert Island Discs to share their eight favourite songs, a book, and a luxury item that they would want if stuck on a desert island. As a lighter data story, it could be fun to explore the themes and trends combining data from open data sets for people and music, and perhaps even predicting what a future celebrity might pick.

Which datasets will you be using in this Turing Data Story? The BBC website includes ~2300 episodes - it looks like somebody took a stab at compiling that into a dataset a few years back. There could be different places to get info about songs (year, artist, genre) and people (birth date, profession) but MusicBrainz and Pantheon look promising.

Additional context Again, I'm more of an idea person that a data scientist, and this is a long way from my research, but it is something I've been hoping someone would put together (and I think the programme has a big following that might be interested in this kind of thing).

Ethical guideline

Ideally a Turing Data Story has these properties and follows the 5 safes framework.

Current status

Updates

samvanstroud commented 3 years ago

Thanks a lot for this story suggestion @billfinnegan! I really like the opportunities for exploration that this would offer. Firstly some basic analysis of music taste as a function of time, profession, etc, finding out which "notables" have the most similar taste, would be fun.

Additionally, quoting from the creator of the DID dataset you link:

It should be said that I see this as a dataset about the BBC program, and the choices its creators have made. It is certainly a better representation of the BBC and its choices, more than it is a complete representation of any of the guests themselves.

This also opens up a new angle: analysing the show itself, the representation of different groups on the show, etc.

The analysis would present opportunities for the reader to learn about simple statistical techniques, data visualisation, and RESTful APIs. Of course we could take things further if we wanted to analyse information contained in song lyrics/or spectrograms as suggested by @DavidBeavan.

I'd definitely be interested in taking more of a look at this on the technical side. Cheers.

samvanstroud commented 3 years ago

Looking again at the linked dataset, it seems this only contain around 400 episodes from the show (mostly around 2000-2010 it appears). It might be interesting also to write a simple web scraper as part of the story and extract info for all 2300 episodes on the BBC website.

crangelsmith commented 3 years ago

Notes from the Tech Talk scoping session:

People interested in taking this story forward: @crangelsmith, @mishkanemes, @edwardchalstrey1, @Arielle-Bennett, @gabastil

edwardchalstrey1 commented 3 years ago

What I meant on the BBC ggplot2 was that they have their own version bbplot: https://bbc.github.io/rcookbook/#how_to_create_bbc_style_graphics <- might be fun to use since we're visualising BBC radio 4 data!

samvanstroud commented 3 years ago

Hackmd for the meeting this afternoon: https://hackmd.io/HVrToWexR3yqtiF34LCo7w

edwardchalstrey1 commented 3 years ago

for the classical pieces, could use: https://openopus.org/

crangelsmith commented 3 years ago

@all-contributors please add @billfinnegan for ideas

allcontributors[bot] commented 3 years ago

@crangelsmith

I've put up a pull request to add @billfinnegan! :tada:

billfinnegan commented 3 years ago

Data clean up: from a scan of the data, some std_name values look a little funny (single names):

billfinnegan commented 3 years ago

Data clean up: from scan of data, some unusual episodes (multiple guests, retrospective, etc.):

billfinnegan commented 3 years ago

Data clean up: prefix/suffix that may need to be stripped. Note: stripping away lord (or earl, viscount, marquess) when followed by “of [a location]” probably isn’t going to work. In other cases removing the title will result in a single word for name, which might be honorific rather than an actual surname, so also not match.

admiral and archbishop baroness bishop canon captain  chief colonel commander commissioner countess dame dfc dr dso duchess earl frs general group hah hon jar judge justice lady lieutenant lord  lt madame marchioness marquess most mp mrs obe om price princess professor qc ra rabbi rev reverend  rm rt  signalman sir sister the vc very viscount wing 

edwardchalstrey1 commented 3 years ago

Data clean up: prefix/suffix that may need to be stripped. Note: stripping away lord (or earl, viscount, marquess) when followed by “of [a location]” probably isn’t going to work. In other cases removing the title will result in a single word for name, which might be honorific rather than an actual surname, so also not match.

admiral and archbishop baroness bishop canon captain  chief colonel commander commissioner countess dame dfc dr dso duchess earl frs general group hah hon jar judge justice lady lieutenant lord  lt madame marchioness marquess most mp mrs obe om price princess professor qc ra rabbi rev reverend  rm rt  signalman sir sister the vc very viscount wing

As an Englishman I'd like to apologise for the length of this list 😆

billfinnegan commented 3 years ago

Draft outline at: https://hackmd.io/@B9ykkfK4TYCyUIbn6sgNdw/Hy3PXmlPu

billfinnegan commented 3 years ago

Based on the episode list on Wikipedia, I've created an alternate list of castaways which includes the link to their wikipedia profile - see DID_castaways_wikipedia.csv in Google drive. I also used this as a chance to sort out the multi-guest episodes indicated above (although I left the "Eight inhabitants from the Ascension Islands in the South Atlantic" as I couldn't find any more details about them), so only one person per row with the episode date and link repeated. This goes all the way to the end of May and has a castaway count of 3294, compared to Andrew's episode count of 3225 up to Feb 2020.

samvanstroud commented 3 years ago

wow great thanks @billfinnegan, I'll get on updated wikidata entries next week !