Reproducible-Science-Curriculum / RR-Jupyter-Hackathon-Jan-2017

Curriculum Development Hackathon on Reproducible Research using Jupyter Notebooks, to be held Jan 9-11 at BIDS in Berkeley, CA
Creative Commons Zero v1.0 Universal
24 stars 3 forks source link

what will be the motivating dataset(s)? #3

Closed raynamharris closed 7 years ago

raynamharris commented 7 years ago

I'm wondering what data will be the focal point of the curricula? Where should it fall on the spectrum of domain specific to generally applicable to a wide audience?

Data Carpentry and Software Carpentry have a variety of domain specific (ecology, genomics) and more general (patient inflammation, gapminder) datasets.

I've also recently saw a nice reproducible ipython workflow around a Zika RNAseq study that seems to span basic and clinical research. I know @olgabot has some nice published RNA-seq data that she has used for teaching purposes.

Other thoughts, suggestions, comments?

olgabot commented 7 years ago

Yes, my (single-cell) RNA-seq teaching data are here. One dataset has 18 samples and ~6,000 features (shalek2013) and the other has ~300 samples and ~20,000 features (macaulay2016). I would teach a full example, e.g. PCA, with the small dataset and then they would have to do it on the large dataset.

In my experience, when teaching machine learning or programming to biologists, the datasets and questions must be biologically driven. Biologists get a bad rap for "not getting" math and that's because that the examples they're taught just aren't relevant to them. What was unintuitive to me was that when I used datasets from a paper, which I thought would be more work for them to understand since papers are hard, they totally got it. But when I used toy datasets like the MNIST handwritten digits dataset, they could understand the point of k-means clustering on THIS dataset, but it was too much of a jump for them to relate it back to their own research. Biologists just don't have the time to make that jump -- if they eventually want to start doing informatics full-time then yes they'll get obsessive about alpha values but for now they just need what works.

All that leads to my question: Is the goal of the workshop to develop a single curriculum or make a "curriculum platform" for teaching reproducible research? My impression was towards the platform side so we could remix for our own domains, because I definitely couldn't get biologists to sift through particle physics data.

raynamharris commented 7 years ago

Thanks Olga! These are all very good points. I didn't articulate it, but I do agree that being able to have a single platform that would work with multiple data sets would be awesome for many reasons. You mention being able to remix it for different domains, but you could even think about a single researcher or lab that has multiple projects or does integrative research with multiple data types.

burkesquires commented 7 years ago

I have had very similar experience to Olga with regards to biologists and examples.

That being said, I think the idea of developing a "platform" seems to make most sense to me and perhaps a biological example could be used as a test case for the platform.

Burke

On Jan 6, 2017, at 12:06 AM, Olga Botvinnik notifications@github.com wrote:

Yes, my (single-cell) RNA-seq teaching data are here https://github.com/YeoLab/single-cell-bioinformatics. One dataset has 18 samples and ~6,000 features (shalek2013) and the other has ~300 samples and ~20,000 features (macaulay2016). I would teach a full example, e.g. PCA, with the small dataset and then they would have to do it on the large dataset.

In my experience, when teaching machine learning or programming to biologists, the datasets and questions must be biologically driven. Biologists get a bad rap for "not getting" math and that's because that the examples they're taught just aren't relevant to them. What was unintuitive to me was that when I used datasets from a paper, which I thought would be more work for them to understand since papers are hard, they totally got it. But when I used toy datasets like the MNIST handwritten digits dataset https://en.wikipedia.org/wiki/MNIST_database, they could understand the point of k-means clustering on THIS dataset, but it was too much of a jump for them to relate it back to their own research. Biologists just don't have the time to make that jump -- if they eventually want to start doing informatics full-time then yes they'll get obsessive about alpha values but for now they just need what works.

All that leads to my question: Is the goal of the workshop to develop a single curriculum or make a "curriculum platform" for teaching reproducible research? My impression was towards the platform side so we could remix for our own domains, because I definitely couldn't get biologists to sift through particle physics data.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/Reproducible-Science-Curriculum/RR-Jupyter-Hackathon-Jan-2016/issues/3#issuecomment-270832388, or mute the thread https://github.com/notifications/unsubscribe-auth/AA101DnHNdvyNriSLpLaLKLeBkiaWjp4ks5rPcvngaJpZM4LcZgO.

choldgraf commented 7 years ago

I think @olgabot has the right question at the end of her comment above. We should nail down the goals of this hackathon to make sure we're all on the same page. For me, I think it'd be most useful to try and nail down a framework that is somewhat domain-agnostic, and then come up with concrete examples of how our respective fields could utilize that framework for our own teaching usecases.

hlapp commented 7 years ago

FYI, the current Reproducible Research workshop curriculum based around RMarkdown uses a single domain-neutral data set (a simplified version of the gapminder dataset) throughout the lessons. I think this worked reasonably well, including for the biologists who were in the course (which was the vast majority).

That isn't to say the conclusion for the Jupyter-based curriculum can't be different.

hlapp commented 7 years ago

We should nail down the goals of this hackathon to make sure we're all on the same page.

The goal of the hackathon is to create a first draft of a teachable 2-day workshop curriculum for reproducible research-promoting practices using Jupyter Notebooks. I think what you mean to ask about though is to nail down the learning objectives of the curriculum. I believe @ErinBecker or @tracykteal wanted to think about and post a draft to start discussion and refinement. Not sure where they're at with that.

choldgraf commented 7 years ago

yep, sorry I should have been more clear. I just like to hash out details sooner than later so that we don't end up talking past each other :)

It sounds like some other folks already have thoughts on this so looking forward to hearing from the group!

raynamharris commented 7 years ago

Thanks @hlapp. I'm a biologists, and I've been big fan of the domain-neutral gapminder dataset, which @naupaka introduced me to a year or so ago.

raynamharris commented 7 years ago

@hlapp Is there a link to the Rmarkdown Reproducible Research workshop curriculum? I'd be very curious to see it, mostly because I love Rmarkdown but also because it would be useful to see what y'all have been producing.

ahofmann4 commented 7 years ago

@raynamharris I think this link is what you are looking for.

The links to the other workshop that was developed are in the call for participation.

elliewix commented 7 years ago

Designing a flexible framework where the right dataset can be slotted in without an egregious amount of fussing will make what we come up with much more powerful. For example, a group of us were able to pretty easily adapt the ecology data carpentry materials to a library/humanities crowd because it includes sections on text, categories, ordinal, and continuous data types. We had to come up with our own version of such a dataset that would interest our folk, but it slotted in nicely because all the functional pieces where there. We would not have been able to do that with the inflammation data because it is purely numerical.

So coming at this with the perspective of talking about handling specific types of data first then adding in example data (+1 gapminder) second should help us keep things flexible.

nerdcommander commented 7 years ago

@elliewix I like the way you put this. I'm teaching class with Physics, Chemistry, Neuroscience, and Biology undergrads and part of my job in the class is to try to relate things to all of those different students populations, so flexible is good. But... I'd love for these materials to have some data built in at least for me if not also for the students.

elliewix commented 7 years ago

@nerdcommander Sorry, I should clarify, I wasn't suggesting taking the example data out, but shifting our design perspective to being an activity/framework first approach. So we construct examples and highlight core activities that are central to working with that type of data value and then plug in our example dataset. This means that an instructor (or student working on their own!) should hopefully be able to follow the lessons and adjust them to their own data on the fly. This won't be perfect at every turn, so more of a perspective rather than rule.

nerdcommander commented 7 years ago

@elliewix that's what I heard, and I like the idea!

tracykteal commented 7 years ago

Since this is a hackathon, it is definitely helpful to have our dataset determined up front, so we have a dataset to work with throughout. We've had good luck with the Gapminder dataset in the R Reproducible Research curriculum as well as in the Software Carpentry r-novice materials because as @raynamharris and @naupaka mentioned, it's domain neutral and interesting and accessible.

Also the data is publicly available data and easy to access. Jenny Bryan has created a gapminder dataset that's good for teaching. So, we're thinking the best plan is to use this dataset to develop the curriculum. Then if people want to modify lessons for more specific domains, they can swap in a more domain specific dataset.

fmichonneau commented 7 years ago

+1 for gapminder

On Sun, Jan 8, 2017 at 8:41 PM, Tracy Teal notifications@github.com wrote:

Since this is a hackathon, it is definitely helpful to have our dataset determined up front, so we have a dataset to work with throughout. We've had good luck with the Gapminder dataset in the R Reproducible Research curriculum as well as in the Software Carpentry r-novice materials https://swcarpentry.github.io/r-novice-gapminder/ because as @raynamharris https://github.com/raynamharris and @naupaka https://github.com/naupaka mentioned, it's domain neutral and interesting and accessible.

Also the data is publicly available data and easy to access. Jenny Bryan has created a gapminder dataset https://github.com/jennybc/gapminder that's good for teaching. So, we're thinking the best plan is to use this dataset to develop the curriculum. Then if people want to modify lessons for more specific domains, they can swap in a more domain specific dataset.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/Reproducible-Science-Curriculum/RR-Jupyter-Hackathon-Jan-2016/issues/3#issuecomment-271173806, or mute the thread https://github.com/notifications/unsubscribe-auth/AFP3yqrpadFtdKSYEekX40GT2VB5m3r6ks5rQTvggaJpZM4LcZgO .

tracykteal commented 7 years ago

These are links to the R Reproducible research lessons

R Reproducible research lessons

dsoto commented 7 years ago

I like the idea of having a default, domain-neutral data set as well as the flexibility to insert a custom dataset if the instructor would like.

In the absence of a custom dataset, we could create an exercise where participants talk in breakout groups about their own data and brainstorm ways to apply the workshop lessons to their specific research problems.

ErinBecker commented 7 years ago

From the current conversation, it seems like the consensus is that we'll be using gapminder for the core lessons? Is this accurate? If so, should we close this issue?