Closed adswa closed 5 years ago
[Hmm, my comment this morning wasn't posted ... so again]
Using a music dataset could raise an eyebrow, because "Why? there is spotify!", but I personally think this is something that could fly.
The Q is what would the purpose of such a dataset beyond plain consumption, i.e. what would datalad run
be used for.
Maybe the production of a "mix tape" (or some other creative process) from a set of independently curated (or owned) music collections could be a use case that would carry quite far. Maybe we should start filling a table with a mapping of datalad functionality onto this scenario.
I have another idea, and I appreciate feedback on it:
A dataset that evolves during a university/high-school/online course. It might be not the current situation for many readers, but certainly everyone has experience with collecting their own, growing amounts of notes, slides, books, or audio recordings of some sort of lectures.
The content is more or less arbitrary, no-one has to actually read what is inside the repository. But we could turn it into a more-or-less DataLad-101 course. The first chapters can focus on easy stuff:
The narrative can start with creating a dataset, and some subdatasets, e.g. for books, slides, homeworks (create
). We can populate this by installing subdatasets (e.g. the DataLad machine learning books dataset or talks/posters the DataLad team created about DataLad (install
), and also by instructing users to create and populate own datasets, e.g. either with books they have on their harddrive, or by pointing them to generally useful, free books (e.g. introductions to Unix, git, ... maybe as an inspiration to read those, if they want) to download and save (add
, save
). Any sort of text file could be used to have them take some notes ("Create a / Add to the mynotes.txt
file, and write a 1-2 sentence reminder for yourself about the last datalad commands" in every section) and save their changes regularly (status
, diff
). The more advanced chapters can include simple Python scripts (to introduce readers to the Python API, those could be created by developing them in the book and having readers copy them, or the could be a git repository somewhere) and simple datasets (maybe we can turn one of the classical data science datasets into a DataLad dataset, e.g. the Iris dataset). This could be a "final project" in the narrative (without any reader having to invest time to code). On those, a datalad run
and rerun
would be easy to demonstrate. Alternatively, a datalad run
can also be demonstrated by renaming files. Subdatasets (e.g. a final project
dataset) could be published to Github or similar third party infrastructure (if we keep results small, this should work very easily, and we could demonstrate data retrieval as we did with REMoDNaV).
I'm thinking of a repo structure a bit like this:
Datalad-Basics-101
|__ books/
| |__ machine-learning-books (from datalad)
| |__ more-books # maybe useful Unix resources? Data Science \w Unix, pro Git, ...?
|__ homework/
| |__ midterm
| |__ final
| |__ code/ # this could be pre-written, simple code to be installed from somewhere
| |__ data/ # very simple data science dataset
| |__ results/ # to be populated with a datalad run. maybe .tsv, .png, .svg, ... (i.e. many different) file types as output
|__ mynotes.txt
|__ slides_and_presentations
|__ DataLad_poster_yoda
|__ DataLad_slides
In the book, we can recreate this structure step-by-step together with readers, but maybe we can also have a final, picture perfect Dataset in our datalad-handbook
organization for people that don't want to follow along to explore.
I like this idea! A student experience/workflow should be easy to relate to for any prospective DataLad user. At the same time it puts almost no constraints on the specific tasks/scenarios.
I guess we should try to draft a list of the functionality we want to demo, and figure out how to approach them best and in which order.
Glad to hear that you like it.
Here is a table that we can fill in:
Datalad command | Demo on | Order |
---|---|---|
create |
early | |
install |
early | |
get |
||
add (?) |
||
save |
||
status |
||
diff |
||
run |
late | |
rerun |
late | |
publish |
late |
My current unordered thoughts are:
creating
an empty (super)dataset
. installing
one (my suggestion would be the machine learning dataset from the datalad dataset collection) and creating subdatasets from scratch.
add
/save
data to the repositories on different levels. This is twofold:
save
the changesbooks/
subdataset?) and the superdatasetdatalad status
get
data (from installed datasets), and maybe talk about some of the underlying principles.datalad run
(and potentially rerun
). This can be the point in time to talk about content being locked
. Could be incorporated as a "midterm project"update
).gitmodule
, running two datalad run
\s on the same dataset simultaneously, not being able to share small file content on Github because it got accidentally annexed. All of that is very specific, and requires a lot of background knowledge, but if it happens to people they should have a way to help themselves. Maybe there are also clear DON'TS (I don't know about any apart from "don't run two datalad runs simultaneously on the same dataset", but maybe there is something a regular person would not think about, such as spaces in directory names...)Noting a few take-aways of a discussion with @loj:
General points:
datalad status
usage everywhere possible) and conflation of content (e.g. having commands not as stand-alones, but always in conjunction with others for complete workflows).Order of commands and content associated with it:
datalad install
content, but we want to motivate them to actually use it. Therefore, we start with datalad create
, adding and modifying content, datalad save
ing such modifications. For this, in the narrative, we will have people add slides and notes to their superdatasetdatalad install
as a different way to end up with a dataset, and attach a section on datalad get
and the necessary background to it. For this, we want to use the datalad machine learning books dataset.datalad status
, datalad diff
, best practices for commit messages, and so forthdatalad run
and datalad rerun
. We will show concepts of content being locked by git-annex.Misc:
Sounds great. I particularly appreciate the decision to prioritize creation over consumption.
Few comments:
From a local workflow, we want to go to publish and hence the start of collaborative workflows. In the narrative, we'll package that as a student we want to share our notes with.
While the associated commands work, there is considerable work necessary before they are in the same shape as the core-family. That isn't necessarily a showstopper, but worth mentioning.
Once we have a python script, we use it to demonstrate datalad run and datalad rerun. We will show concepts of content being locked by git-annex.
The issue with locked files, symlinks etc. is a critical usability aspect from my POV. Of all things, this is likely the one that is least familiar and intuitive for people who don't breath UNIX, and a great source of confusion. I have no immediate idea on who to address that best, but I feel this should at least be superficially mentioned in the "dataset basics". Maybe even by just explicitly describing the default look-and-feel of a dataset with symlinks and locked files, while simultaneously acknowledging the implications (e.g. tools like AFNI go insane) and pointing to possible alternatives (annex V7 mode, adjusted-unlocked branches). I can help with that, once you have determined what examples you want to show.
The issue with locked files, symlinks etc. is a critical usability aspect from my POV. Of all things, this is likely the one that is least familiar and intuitive for people who don't breath UNIX, and a great source of confusion. I have no immediate idea on who to address that best, but I feel this should at least be superficially mentioned in the "dataset basics". Maybe even by just explicitly describing the default look-and-feel of a dataset with symlinks and locked files, while simultaneously acknowledging the implications (e.g. tools like AFNI go insane) and pointing to possible alternatives (annex V7 mode, adjusted-unlocked branches). I can help with that, once you have determined what examples you want to show.
We agree on that its a great source of confusion and we're also uncertain on how to address it. Thanks for the suggestions of describing the default look - I will try, and we can enhance what I come up with together. I certainly will need help, but will speak up once it becomes necessary.
The DataLad-101 example dataset has shown that its well-suited for what we have until now, and this issue served as a good basis, and most of the initial brain storming is either already implemented or outdated - I think we can close this issue.
I want to make note of an idea @mih brought up and continue the discussion about it:
Supply a toy dataset that readers can install and learn with, together with book sections that follow a narrative based on this dataset.
There are some requirements:
get
single files in tutorial snippets, I wouldn't want to pollute readers file systems with GBs of data should they accidentally do adatalad get .
ìnstall
it from (tbh, I personally don't know how to publish a dataset in such a way that the data is accessible to everyone, but I would like to know how. Once we have a narrative and content, maybe @mih could show us in person how to do that)datalad run
on the dataset, also in a way that does not appear to be a completely random action. Content-wise it could be something simple to show principles of using and unlocking content, like maybe renaming files with a shell or python script to showcase how one can change existing files (with--input
and--output
flags). We should also have adatalad run
example that creates a completely new file.Having such a dataset plus the narrative will make progress on command and workflow explanations much easier I believe. One idea @loj and @mih proposed was a music library. This has the great advantage of easy, almost domain-agnostic narratives, and I think the requirements I came up with could be fulfilled with it. Does any one have additional thoughts on this idea in general, other requirements for such a dataset, or different content/narrative ideas?