UofTCoders / workshops-dc-python

Python data carpentry workshops https://uoftcoders.github.io/workshops-dc-python/
Other
2 stars 2 forks source link

More engaging data set suggestions #1

Open joelostblom opened 6 years ago

joelostblom commented 6 years ago

It would be great if we had a data set that was more broadly interesting than the ecology specific data we have now. Something that more participants can relate to. Some ideas:

It does not have to be fancy, something that is simple and easy to understand is good. As long as it has a few categorical variables to group/facet by. Any ideas?

SaraMati commented 6 years ago

1) Allan Brain Institute Cell Type database: e.g. all the features you can see in this interactive page can be downloaded in a csv format. I can select easily understandable features like age, sex, race, brain region, some numeric features about the electrophysiology (e.g. simple features like the amplitude of the spike) and some features about the cell shape etc. into a csv for the lesson.

2) From the UCI repository 2.1) Student Alcohol consumption (this is the name that it's famous for, but it has ~30 attributes including social, gender and study data from secondary school students) 2.2) Bike sharing data set

3) this page has a comprehensive list: https://github.com/awesomedata/awesome-public-datasets

joelostblom commented 6 years ago

Thanks for posting those links @SaraMati! I guess I already mention some of my thoughts on the call, but I will include them here for completeness:

  1. Personally, I think the Allan Brain data set (and their page in general) looks very interesting! However, I am hesitant to wether it will be interesting enough for participants that are not biologists... (similar problem to our current data set)
  2. The UCI-ml datases repo always looks really promising to me, but then I get overwhelmed trying to dig through it... I think both the data sets you mention are interesting and could be good matches.
  3. I have been through this list before without finding the perfect match, although I have a few promising hits from my searches here that I need to go through more.

This is the link to the AirBnB data sets that @linamnt mentioned http://insideairbnb.com/get-the-data.html. I think it is great that there is a data set for each city, this increases relevance to participants at our workshops in Toronto, and makes the material still adaptable to other locations if that would be needed. The listings.csv.gz file has many interesting features which could be explored through split-apply-combine and faceting, for quantitative variable there is price and different review ratings at least.


Another interesting data set is the gapminder data . I like this because we can get a change to teach participants (and ourselves) about the world and dispel common myths. Since this data set is global and many people might have heard some of these things before, I think the relevance is still there. The issue here would be that there is so much data available that it might be hard to choose what to look at. There is also 5-10 min survey to see how ignorant one is about the world so we could use the questions there as a starting point for what to explore and then dig into them deeper.


I really like the idea of being able to teach something important about the world while participants are learning data wrangling techniques. Having that said, I still think the main criteria to optimize for is that the data that invokes as much curioity as possible and makes participants the most likely to formulate their own questions and hypothesis about the data. Having students think about the problems make them more likely to learn and remember how to explore the data and they will also have the most fun while doing it!

Thoughts?

joelostblom commented 6 years ago

Search engine for data mentioned by Lina http://namara.io/

joelostblom commented 6 years ago

@SaraMati @mbonsma @linamnt Don't forget to add your thoughts about the data sets here and I can start looking into what we think is the most suitable one. I will likely not have time to do this until around the 3-4 or maybe even 10th of August, but let's hear everyone's opinions before then!

linamnt commented 6 years ago

Of the ones mentioned, I would choose the ones accessible (in terms of understanding and interest) to the most, so while I like the AirBnB one, it could be that DC is taught in places that insideairbnb.ca has not yet scraped. I guess they could always do the biggest city nearby? Either way, I think that gapminder is always a really good one! There's so much you can do with the data and everyone can be interested, not just public health people.

mbonsma commented 6 years ago

My vote is also for either AirBnB or gapminder. My only concern with gapminder is that it's already the dataset used for the Software Carpentry workshops, so are we really creating anything new? Here's the subset that SWC is using: https://github.com/swcarpentry/r-novice-gapminder/tree/gh-pages/data. Maybe we could just look at different parts of the data, since it's so huge.

The AirBnB one seems really neat, and I'd be curious to explore it myself anyway and see if there are interesting associations that predict rating, etc. Are there any privacy issues we should watch out for with that?

mbonsma commented 6 years ago

I just took the gapminder survey and got 23% correct...

joelostblom commented 6 years ago

=) And now you would reeeaaaly like to explore that dataset to find out more about your misconceptions, right???!!! At least that is how we want the students to fee!

I actually think it is a good thing that a subset of the gapminder data set is used in Software Carpentry. Then we know they like this data set! I looked through the R lesson, and there isn't much narrated exploratory data analysis, it certainly feels more like an R function showcase (and it is in software carpentry so maybe that makes sense).

Our gapminder subset will also be slighty different. Not sure exactly what it would be yet, but in general I would like to cover as many of the survey questions as possible, as long as that data lends itself to deeper exploration in accordance with the concepts we want to teach.

SaraMati commented 6 years ago

I love the idea behind the Gapminder program/website and I agree with you that it's nice to have cool global stats data. I guess it's just about finding a good data set that has attributes that we can use to teach the data wrangling techniques we want to teach. I just opened the first spreadsheet that is the prevalence of people with HIV and it seems it's only yearly percent values. This doesn't give us enough attributes to play with, right? (like doing group by or transform between tidy and wide etc.). Do we have to find a good dataset among their list, or you already know one? or maybe I'm getting it all wrong?

mbonsma commented 6 years ago

I think we should pick two or maybe 3 related files to work with so that we can look at correlations between stuff but so that it doesn't get too complicated. Two cool areas to focus on would be climate and education, I think. We can also combine some files to make our own larger file to work with.

Education-related

Climate-related

mbonsma commented 6 years ago

Those education files really interest me, it would be cool to make a big dataset combining all the education data to work with.

SaraMati commented 6 years ago

I know it's not good to add to the options and make decision making more complicated, but with respect to World data, I just ran into a very cool website:

https://ourworldindata.org/

go to the different tabs (population, health, food, .... education, media, culture), they have already visualised the data which you can immediately download from the Data tab under each figure. It may be easier to browse through and gather the data we find useful.

SaraMati commented 6 years ago

on another note, it seems we're all away for the month of August and I'm not sure we have enough time to make a nice data set.

I should once again point out that it was my mistake to call that UCI "Student Performance" data set only as alcohol performance (it's because I knew the data set from a fun example who only worked with this attribute,). The data set has 33 attributes that pretty much cover those education-related indicators you mentioned @mbonsma, albeit not worldwide. The data is only from two schools, two courses (Math and Portuguese). If we find we don't have the time to put together data from gapminder (or ourworldindata.org?), this can come handy.

Attributes for both student-mat.csv (Math course) and student-por.csv (Portuguese language course) datasets: 1 school - student's school (binary: 'GP' - Gabriel Pereira or 'MS' - Mousinho da Silveira) 2 sex - student's sex (binary: 'F' - female or 'M' - male) 3 age - student's age (numeric: from 15 to 22) 4 address - student's home address type (binary: 'U' - urban or 'R' - rural) 5 famsize - family size (binary: 'LE3' - less or equal to 3 or 'GT3' - greater than 3) 6 Pstatus - parent's cohabitation status (binary: 'T' - living together or 'A' - apart) 7 Medu - mother's education (numeric: 0 - none, 1 - primary education (4th grade), 2 – 5th to 9th grade, 3 – secondary education or 4 – higher education) 8 Fedu - father's education (numeric: 0 - none, 1 - primary education (4th grade), 2 – 5th to 9th grade, 3 – secondary education or 4 – higher education) 9 Mjob - mother's job (nominal: 'teacher', 'health' care related, civil 'services' (e.g. administrative or police), 'at_home' or 'other') 10 Fjob - father's job (nominal: 'teacher', 'health' care related, civil 'services' (e.g. administrative or police), 'at_home' or 'other') 11 reason - reason to choose this school (nominal: close to 'home', school 'reputation', 'course' preference or 'other') 12 guardian - student's guardian (nominal: 'mother', 'father' or 'other') 13 traveltime - home to school travel time (numeric: 1 - <15 min., 2 - 15 to 30 min., 3 - 30 min. to 1 hour, or 4 - >1 hour) 14 studytime - weekly study time (numeric: 1 - <2 hours, 2 - 2 to 5 hours, 3 - 5 to 10 hours, or 4 - >10 hours) 15 failures - number of past class failures (numeric: n if 1<=n<3, else 4) 16 schoolsup - extra educational support (binary: yes or no) 17 famsup - family educational support (binary: yes or no) 18 paid - extra paid classes within the course subject (Math or Portuguese) (binary: yes or no) 19 activities - extra-curricular activities (binary: yes or no) 20 nursery - attended nursery school (binary: yes or no) 21 higher - wants to take higher education (binary: yes or no) 22 internet - Internet access at home (binary: yes or no) 23 romantic - with a romantic relationship (binary: yes or no) 24 famrel - quality of family relationships (numeric: from 1 - very bad to 5 - excellent) 25 freetime - free time after school (numeric: from 1 - very low to 5 - very high) 26 goout - going out with friends (numeric: from 1 - very low to 5 - very high) 27 Dalc - workday alcohol consumption (numeric: from 1 - very low to 5 - very high) 28 Walc - weekend alcohol consumption (numeric: from 1 - very low to 5 - very high) 29 health - current health status (numeric: from 1 - very bad to 5 - very good) 30 absences - number of school absences (numeric: from 0 to 93)

these grades are related with the course subject, Math or Portuguese: 31 G1 - first period grade (numeric: from 0 to 20) 31 G2 - second period grade (numeric: from 0 to 20) 32 G3 - final grade (numeric: from 0 to 20, output target)