datalad-handbook / book

Sources for the DataLad handbook
http://handbook.datalad.org
Other
145 stars 55 forks source link

Example dataset to play with #25

Closed adswa closed 5 years ago

adswa commented 5 years ago

I want to make note of an idea @mih brought up and continue the discussion about it:

Supply a toy dataset that readers can install and learn with, together with book sections that follow a narrative based on this dataset.

There are some requirements:

Having such a dataset plus the narrative will make progress on command and workflow explanations much easier I believe. One idea @loj and @mih proposed was a music library. This has the great advantage of easy, almost domain-agnostic narratives, and I think the requirements I came up with could be fulfilled with it. Does any one have additional thoughts on this idea in general, other requirements for such a dataset, or different content/narrative ideas?

mih commented 5 years ago

[Hmm, my comment this morning wasn't posted ... so again]

Using a music dataset could raise an eyebrow, because "Why? there is spotify!", but I personally think this is something that could fly.

The Q is what would the purpose of such a dataset beyond plain consumption, i.e. what would datalad run be used for.

Maybe the production of a "mix tape" (or some other creative process) from a set of independently curated (or owned) music collections could be a use case that would carry quite far. Maybe we should start filling a table with a mapping of datalad functionality onto this scenario.

adswa commented 5 years ago

I have another idea, and I appreciate feedback on it:

A dataset that evolves during a university/high-school/online course. It might be not the current situation for many readers, but certainly everyone has experience with collecting their own, growing amounts of notes, slides, books, or audio recordings of some sort of lectures.

The content is more or less arbitrary, no-one has to actually read what is inside the repository. But we could turn it into a more-or-less DataLad-101 course. The first chapters can focus on easy stuff: The narrative can start with creating a dataset, and some subdatasets, e.g. for books, slides, homeworks (create). We can populate this by installing subdatasets (e.g. the DataLad machine learning books dataset or talks/posters the DataLad team created about DataLad (install), and also by instructing users to create and populate own datasets, e.g. either with books they have on their harddrive, or by pointing them to generally useful, free books (e.g. introductions to Unix, git, ... maybe as an inspiration to read those, if they want) to download and save (add, save). Any sort of text file could be used to have them take some notes ("Create a / Add to the mynotes.txt file, and write a 1-2 sentence reminder for yourself about the last datalad commands" in every section) and save their changes regularly (status, diff). The more advanced chapters can include simple Python scripts (to introduce readers to the Python API, those could be created by developing them in the book and having readers copy them, or the could be a git repository somewhere) and simple datasets (maybe we can turn one of the classical data science datasets into a DataLad dataset, e.g. the Iris dataset). This could be a "final project" in the narrative (without any reader having to invest time to code). On those, a datalad run and rerun would be easy to demonstrate. Alternatively, a datalad run can also be demonstrated by renaming files. Subdatasets (e.g. a final project dataset) could be published to Github or similar third party infrastructure (if we keep results small, this should work very easily, and we could demonstrate data retrieval as we did with REMoDNaV).

I'm thinking of a repo structure a bit like this:

Datalad-Basics-101
    |__ books/
    |    |__ machine-learning-books (from datalad)
    |    |__ more-books                  # maybe useful Unix resources? Data Science \w Unix, pro Git, ...?
    |__ homework/
    |    |__ midterm
    |    |__ final
    |         |__ code/                       # this could be pre-written, simple code to be installed from somewhere
    |         |__ data/                       # very simple data science dataset 
    |         |__ results/                    # to be populated with a datalad run. maybe .tsv, .png, .svg, ... (i.e. many different) file types as output
    |__ mynotes.txt
    |__ slides_and_presentations
         |__ DataLad_poster_yoda
         |__ DataLad_slides 

In the book, we can recreate this structure step-by-step together with readers, but maybe we can also have a final, picture perfect Dataset in our datalad-handbook organization for people that don't want to follow along to explore.

mih commented 5 years ago

I like this idea! A student experience/workflow should be easy to relate to for any prospective DataLad user. At the same time it puts almost no constraints on the specific tasks/scenarios.

I guess we should try to draft a list of the functionality we want to demo, and figure out how to approach them best and in which order.

adswa commented 5 years ago

Glad to hear that you like it.

Here is a table that we can fill in:

Datalad command Demo on Order
create early
install early
get
add (?)
save
status
diff
run late
rerun late
publish late

My current unordered thoughts are:

adswa commented 5 years ago

Noting a few take-aways of a discussion with @loj:

General points:

Order of commands and content associated with it:

Misc:

mih commented 5 years ago

Sounds great. I particularly appreciate the decision to prioritize creation over consumption.

Few comments:

From a local workflow, we want to go to publish and hence the start of collaborative workflows. In the narrative, we'll package that as a student we want to share our notes with.

While the associated commands work, there is considerable work necessary before they are in the same shape as the core-family. That isn't necessarily a showstopper, but worth mentioning.

Once we have a python script, we use it to demonstrate datalad run and datalad rerun. We will show concepts of content being locked by git-annex.

The issue with locked files, symlinks etc. is a critical usability aspect from my POV. Of all things, this is likely the one that is least familiar and intuitive for people who don't breath UNIX, and a great source of confusion. I have no immediate idea on who to address that best, but I feel this should at least be superficially mentioned in the "dataset basics". Maybe even by just explicitly describing the default look-and-feel of a dataset with symlinks and locked files, while simultaneously acknowledging the implications (e.g. tools like AFNI go insane) and pointing to possible alternatives (annex V7 mode, adjusted-unlocked branches). I can help with that, once you have determined what examples you want to show.

adswa commented 5 years ago

The issue with locked files, symlinks etc. is a critical usability aspect from my POV. Of all things, this is likely the one that is least familiar and intuitive for people who don't breath UNIX, and a great source of confusion. I have no immediate idea on who to address that best, but I feel this should at least be superficially mentioned in the "dataset basics". Maybe even by just explicitly describing the default look-and-feel of a dataset with symlinks and locked files, while simultaneously acknowledging the implications (e.g. tools like AFNI go insane) and pointing to possible alternatives (annex V7 mode, adjusted-unlocked branches). I can help with that, once you have determined what examples you want to show.

We agree on that its a great source of confusion and we're also uncertain on how to address it. Thanks for the suggestions of describing the default look - I will try, and we can enhance what I come up with together. I certainly will need help, but will speak up once it becomes necessary.

adswa commented 5 years ago

The DataLad-101 example dataset has shown that its well-suited for what we have until now, and this issue served as a good basis, and most of the initial brain storming is either already implemented or outdated - I think we can close this issue.