Example dataset to play with

datalad-handbook / book

Sources for the DataLad handbook

http://handbook.datalad.org

Other

145 stars 55 forks source link

Example dataset to play with #25

Closed adswa closed 5 years ago

adswa commented 5 years ago

I want to make note of an idea @mih brought up and continue the discussion about it:

Supply a toy dataset that readers can install and learn with, together with book sections that follow a narrative based on this dataset.

There are some requirements:

it needs to be as domain-agnostic as possible. My initial attempt with studyforrest would confuse everyone who is not a neuroscientist
it should be comparatively small. Even if we only get single files in tutorial snippets, I wouldn't want to pollute readers file systems with GBs of data should they accidentally do a datalad get .
it should be large enough however that files could not live within git
it needs to live somewhere where everyone can ìnstall it from (tbh, I personally don't know how to publish a dataset in such a way that the data is accessible to everyone, but I would like to know how. Once we have a narrative and content, maybe @mih could show us in person how to do that)
a variety of operations should be possible on this dataset:
- show super- and subdataset properties: It should have at least one subdataset with a bit of history, to demonstrate how a superdataset keeps track of the subdataset, and how to work with the subdatasets history, and so forth. With a subdataset we could even demonstrate how to "update" a dataset, by including a subdataset in a not-most-recent state in the superdataset and at one point having them pull the most recent changes.
- add data (in a way that makes sense in the narrative we come up with)
- change at least one larger data file
- Demonstrate a datalad run on the dataset, also in a way that does not appear to be a completely random action. Content-wise it could be something simple to show principles of using and unlocking content, like maybe renaming files with a shell or python script to showcase how one can change existing files (with --input and --output flags). We should also have a datalad run example that creates a completely new file.
- it should come with a well-written commit history that is easy to explore and shows best practices (commit messages, commits consisting of changes belonging together and not many unrelated changes, ...).
- ...

Having such a dataset plus the narrative will make progress on command and workflow explanations much easier I believe. One idea @loj and @mih proposed was a music library. This has the great advantage of easy, almost domain-agnostic narratives, and I think the requirements I came up with could be fulfilled with it. Does any one have additional thoughts on this idea in general, other requirements for such a dataset, or different content/narrative ideas?

mih commented 5 years ago

[Hmm, my comment this morning wasn't posted ... so again]

Using a music dataset could raise an eyebrow, because "Why? there is spotify!", but I personally think this is something that could fly.

The Q is what would the purpose of such a dataset beyond plain consumption, i.e. what would datalad run be used for.

Maybe the production of a "mix tape" (or some other creative process) from a set of independently curated (or owned) music collections could be a use case that would carry quite far. Maybe we should start filling a table with a mapping of datalad functionality onto this scenario.

adswa commented 5 years ago

I have another idea, and I appreciate feedback on it:

A dataset that evolves during a university/high-school/online course. It might be not the current situation for many readers, but certainly everyone has experience with collecting their own, growing amounts of notes, slides, books, or audio recordings of some sort of lectures.

The content is more or less arbitrary, no-one has to actually read what is inside the repository. But we could turn it into a more-or-less DataLad-101 course. The first chapters can focus on easy stuff: The narrative can start with creating a dataset, and some subdatasets, e.g. for books, slides, homeworks (create). We can populate this by installing subdatasets (e.g. the DataLad machine learning books dataset or talks/posters the DataLad team created about DataLad (install), and also by instructing users to create and populate own datasets, e.g. either with books they have on their harddrive, or by pointing them to generally useful, free books (e.g. introductions to Unix, git, ... maybe as an inspiration to read those, if they want) to download and save (add, save). Any sort of text file could be used to have them take some notes ("Create a / Add to the mynotes.txt file, and write a 1-2 sentence reminder for yourself about the last datalad commands" in every section) and save their changes regularly (status, diff). The more advanced chapters can include simple Python scripts (to introduce readers to the Python API, those could be created by developing them in the book and having readers copy them, or the could be a git repository somewhere) and simple datasets (maybe we can turn one of the classical data science datasets into a DataLad dataset, e.g. the Iris dataset). This could be a "final project" in the narrative (without any reader having to invest time to code). On those, a datalad run and rerun would be easy to demonstrate. Alternatively, a datalad run can also be demonstrated by renaming files. Subdatasets (e.g. a final project dataset) could be published to Github or similar third party infrastructure (if we keep results small, this should work very easily, and we could demonstrate data retrieval as we did with REMoDNaV).

I'm thinking of a repo structure a bit like this:

Datalad-Basics-101
    |__ books/
    |    |__ machine-learning-books (from datalad)
    |    |__ more-books                  # maybe useful Unix resources? Data Science \w Unix, pro Git, ...?
    |__ homework/
    |    |__ midterm
    |    |__ final
    |         |__ code/                       # this could be pre-written, simple code to be installed from somewhere
    |         |__ data/                       # very simple data science dataset 
    |         |__ results/                    # to be populated with a datalad run. maybe .tsv, .png, .svg, ... (i.e. many different) file types as output
    |__ mynotes.txt
    |__ slides_and_presentations
         |__ DataLad_poster_yoda
         |__ DataLad_slides

In the book, we can recreate this structure step-by-step together with readers, but maybe we can also have a final, picture perfect Dataset in our datalad-handbook organization for people that don't want to follow along to explore.

mih commented 5 years ago

I like this idea! A student experience/workflow should be easy to relate to for any prospective DataLad user. At the same time it puts almost no constraints on the specific tasks/scenarios.

I guess we should try to draft a list of the functionality we want to demo, and figure out how to approach them best and in which order.

adswa commented 5 years ago

Glad to hear that you like it.

Here is a table that we can fill in:

Datalad command	Demo on	Order
`create`		early
`install`		early
`get`
`add` (?)
`save`
`status`
`diff`
`run`		late
`rerun`		late
`publish`		late

My current unordered thoughts are:

We need to start by creating an empty (super)dataset.
This superdataset needs to be populated. We can show how to create a subdataset in the superdataset by installing one (my suggestion would be the machine learning dataset from the datalad dataset collection) and creating subdatasets from scratch.
- There is an option to move/join the section on Datasets to this. This would fix the suboptimal choice of a neuroimaging dataset (studyforrest) to explain the concepts.
We need to add/save data to the repositories on different levels. This is twofold:
- This can showcase a general "local workflow", for example by having readers
  1. add a file to be tracked (Notes on what they learned in the paragraph before?)
  2. modify this file
  3. save the changes
- Demonstrate differences and similarities between changes in a subdataset (Adding pdfs into a books/ subdataset?) and the superdataset
- maybe this is also a good place to start using datalad status
We need to get data (from installed datasets), and maybe talk about some of the underlying principles.
Have DataLad assist with a computation using datalad run (and potentially rerun). This can be the point in time to talk about content being locked. Could be incorporated as a "midterm project"
Talk about the python API, and show it in action (maybe also with datalad run, on an installed but not locally present dataset). Could be incorporated as a "final project"
publish/share content using third party infra structure. E.g. publishing code, data, and results of an analysis (e.g. a "final project" subdataset) to Github. But other third party infrastructure will also be interesting (I just don't know yet how to do it e.g. with Dropbox)
We need to talk about how to share data that is too large for Github
We need to talk about collaborative workflows (e.g. update)
I think we should demonstrate how to revert mistakes at some point.
I would like to talk about everything that can go wrong in the process of all of the above, but I'm still not sure how. Some of the datalad error messages I have encountered previously were completely unsolvable for me without background knowledge. Those included: wrong urls or hard-coded user accounts in repositories in .gitmodule, running two datalad run\s on the same dataset simultaneously, not being able to share small file content on Github because it got accidentally annexed. All of that is very specific, and requires a lot of background knowledge, but if it happens to people they should have a way to help themselves. Maybe there are also clear DON'TS (I don't know about any apart from "don't run two datalad runs simultaneously on the same dataset", but maybe there is something a regular person would not think about, such as spaces in directory names...)

adswa commented 5 years ago

Noting a few take-aways of a discussion with @loj:

General points:

For the sake of getting forward, we abandon the harsh constraint of complete modularity. We will start writing a continuous narrative and for now live with duplication of content (e.g. datalad status usage everywhere possible) and conflation of content (e.g. having commands not as stand-alones, but always in conjunction with others for complete workflows).
Later on, we will read through the created contents to find 1) existing modularity, and 2) ways to improve/increase the books modularity. The overall aim will still be to have a flowchart-like diagram with which we can point people to content for different tastes, but we will end with this instead of start with it
The tone of the narrative should be an inclusive "us". We want to write it from the perspective of a group of learners, instead of the instructor. We'll therefore settle for a "Let us do something", instead of "Do something" or "I'm doing something" in our phrases.

Order of commands and content associated with it:

start with a local workflow. Likely, a lot of people (similar to git) will only consume, i.e. datalad install content, but we want to motivate them to actually use it. Therefore, we start with datalad create, adding and modifying content, datalad save ing such modifications. For this, in the narrative, we will have people add slides and notes to their superdataset
we want to continue with datalad install as a different way to end up with a dataset, and attach a section on datalad get and the necessary background to it. For this, we want to use the datalad machine learning books dataset.
at every point, we will showcase appropriate use of datalad status, datalad diff, best practices for commit messages, and so forth
From a local workflow, we want to go to publish and hence the start of collaborative workflows. In the narrative, we'll package that as a student we want to share our notes with.
having those basics established in the command line, we want to talk about their counterparts in the Python API. We'll package that in script writing for a final project
Once we have a python script, we use it to demonstrate datalad run and datalad rerun. We will show concepts of content being locked by git-annex.
In the form of a "reproducible analysis", we want to join all of the previous content into publishing data, code, and results of a final project to Github

Misc:

We share the feeling that the section on DataLad datasets in Basics is currently a bit out of place, and suboptimal in that it uses the studyforrest dataset. We want to take what is useful in it and integrate it into the first few sections on datalad create, install, and get.

mih commented 5 years ago

Sounds great. I particularly appreciate the decision to prioritize creation over consumption.

Few comments:

From a local workflow, we want to go to publish and hence the start of collaborative workflows. In the narrative, we'll package that as a student we want to share our notes with.

While the associated commands work, there is considerable work necessary before they are in the same shape as the core-family. That isn't necessarily a showstopper, but worth mentioning.

Once we have a python script, we use it to demonstrate datalad run and datalad rerun. We will show concepts of content being locked by git-annex.

The issue with locked files, symlinks etc. is a critical usability aspect from my POV. Of all things, this is likely the one that is least familiar and intuitive for people who don't breath UNIX, and a great source of confusion. I have no immediate idea on who to address that best, but I feel this should at least be superficially mentioned in the "dataset basics". Maybe even by just explicitly describing the default look-and-feel of a dataset with symlinks and locked files, while simultaneously acknowledging the implications (e.g. tools like AFNI go insane) and pointing to possible alternatives (annex V7 mode, adjusted-unlocked branches). I can help with that, once you have determined what examples you want to show.

adswa commented 5 years ago

The issue with locked files, symlinks etc. is a critical usability aspect from my POV. Of all things, this is likely the one that is least familiar and intuitive for people who don't breath UNIX, and a great source of confusion. I have no immediate idea on who to address that best, but I feel this should at least be superficially mentioned in the "dataset basics". Maybe even by just explicitly describing the default look-and-feel of a dataset with symlinks and locked files, while simultaneously acknowledging the implications (e.g. tools like AFNI go insane) and pointing to possible alternatives (annex V7 mode, adjusted-unlocked branches). I can help with that, once you have determined what examples you want to show.

We agree on that its a great source of confusion and we're also uncertain on how to address it. Thanks for the suggestions of describing the default look - I will try, and we can enhance what I come up with together. I certainly will need help, but will speak up once it becomes necessary.

adswa commented 5 years ago

The DataLad-101 example dataset has shown that its well-suited for what we have until now, and this issue served as a good basis, and most of the initial brain storming is either already implemented or outdated - I think we can close this issue.