Select domain / dataset

kescobo commented 3 years ago

Though (in my opinion) less important than #1, we nevertheless need to identify what knowledge domain we will use as a backbone for the lesson. In PM, I wrote:

I think we should spend the bulk of our energy on solidifying what we want to teach, the actual domain is almost irrelevant. That said, I think features of the data should be:

Accessible, by which i mean almost everyone can understand what the data is with minimal context. Even if it’s domain-specific, some things are more understandable to outsiders than others

Inclusive, which is related to (1), but distinct. Eg. Everyone can understand wins/losses of sports teams, but not everyone is into sports. We should try to find something that had broad appeal

Evergreen - one thought I had was that doing something with corona virus data would be cool, but this may feel a bit less engaging 5 years from now

Despite that last point, a coronavirus - based project could definitely fulfill the first 2, especially since it could include both epidemiology as well as biological (sequencing) data types. I'm a little biased here though, as this is my field and I'm already developing some other materials along this line that I could double-dip :-D

Other possibilities:

There are lots of potential public-health type datasets. I was recently engaging with police violence datasets compiled by various organizations, though this may be considered too political and/or triggering. There are also lead exposure / air quality datasets I'm aware of
Climate / weather data is very relatable and evergreen.
Lots of potential ideas from https://www.wikidata.org

jd-foster commented 3 years ago

Thanks for these good, solid ideas. Identifying a domain and a data-set seems a bit chicken-and-egg. But as you say, first issues #1 first, and learning objectives are the main game, then applying backward design.

A central resource for us generally should be the The Carpentries Curriculum Development Handbook, which has a particular section on Picking a dataset. There are ten (!) criteria to guide the process, so more ideas the better in terms of serving the learning objectives and matching up to the criteria.

FYI, my field is optimisation of energy systems, including working with databases of generators, transmission networks and associated time-series of energy generation, utilisation and demand. There may be suitable datasets in there or not. I'm fine working with data outside my area.

kescobo commented 3 years ago

which has a particular section on Picking a dataset.

:sheepish: Probably a good idea to RTF(course development)M... thanks for pointing that out! I have a lot of reading to do :laughing:

jd-foster commented 3 years ago

Oh, so do I. I am only just familiarising myself with the "CDH" (as some abbreviate it to).

jd-foster commented 3 years ago

Here is a potentially interesting and neat data set: Genomic Outposts Serve the Phylogenomic Pioneers: Designing Novel Nuclear Markers for Genomic DNA Extractions of Lepidoptera

I know nothing about the field .... but everyone likes butterflies: Tree of Life: Nymphalidae. There are some great visuals and graphs, a fairly detailed appendix / supplementary data, with a mix of applicable data types (numbers, text, hierarchical types thereof). We would have to clarify that the dataset could be released under a CC0 license.

One last thing : the julia butterfly is a species of Nymphalidae ;)

kescobo commented 3 years ago

Ha! Nice :-). Works for me, and I'd be happy to write up the biology sections. But the conclusions of that paper might be a bit advanced for a non-biologist. I'll dig in a bit more and see if I can come up with a more simplified version.

carpentries-incubator / julia-data-workflow

Select domain / dataset #2