Notes about DataLad usage details

jsheunis commented 2 years ago

@dnkennedy thanks for the summary of the course given in the readme. I have a couple of notes and questions about the various steps where DataLad will be applicable:

The intro to DataLad:
- I'll cover the basics of DataLad, including the topics you listed (installing data, running containers, publishing results)
- should the link to the video be the same link that will be the one accessible via OHBM's virtual platform? (that's assuming we all have to upload our 15min talks to OHBM's platform?). Or should I just post the video on youtube and give that link? I'd prefer the latter, but happy with whatever you suggest.
Installing data: the current standard practice with DataLad is to use clone, i.e. "cloning data". Where previously we used datalad install -r to install all subdatasets recursively, we now suggest first cloning the dataset, and then using get with the [-n][--no-data] flag:
```
datalad clone <dataset_url> my_dataset
cd my_dataset
datalad get -n -r .
```
Publish data: Since publish is deprecated, can I suggest that we use push together with the create-sibling[-*] functionality? Have you already decided which service we'll use to push the output data/results to?

dnkennedy commented 2 years ago

Hi, thanks for the questions/comments.

For now, the 'video' to which I refer in the main README is a place holder, and ultimately that pointer can point wherever you prefer. Regarding the videos, each of us create one, and we upload it to the OHBM site. But, since it's each of 'our' creations, I think (but I guess should double check the policy), we can also use those videos as we want, i.e. to our favorite YouTube locations. So, there may indeed be (at least) two places that these videos get put, and where this README points, like I mentioned, is up to us. I agree that this video pointer being to a more publicly accessible one, such as the YouTube version, would be preferable in some ways, for the long term utility of this repo.

Yes, your assumption that each of these videos need to also go to the OHBM platform. Have you received instructions for OHBM and/or Fourwaves? Actually, I haven't, but there may not have been a 'slot' for the organizers, just for the 'presenters', I'm trying to resolve that at the moment. All presenters, let me know if you have OHBM upload instructions.

Re installing data: the evolving workflow in the Exercise README was guided by @yarikoptic and featured the 'datalad install' commands in a YODA-style design. The nuance of 'clone' versus 'install', is over my head and I'll leave the decisions about what and how to best do this to your collective advice. Y'all know what we're trying to demonstrate, feel free to update my hacky way of doing this to something better.

I was using publish more conceptually rather than specific command-ly. I would value guidance in how to approach the details of this, I've just not yet personally gotten to that step to muddle through it yet... I was indeed expecting something 'create-sibling'ish, and had not decided a service (Gin appears in some other exercises I've been through). I just don't want the students to have to spend too much time struggling through authentication details, which, while important, would be a distraction to the mission...

jsheunis commented 2 years ago

Thanks for the info @dnkennedy !

Re installing data: the evolving workflow in the Exercise README was guided by @yarikoptic and featured the 'datalad install' commands in a YODA-style design. The nuance of 'clone' versus 'install', is over my head and I'll leave the decisions about what and how to best do this to your collective advice. Y'all know what we're trying to demonstrate, feel free to update my hacky way of doing this to something better.

Sound good, thanks 👍

I was using publish more conceptually rather than specific command-ly. I would value guidance in how to approach the details of this, I've just not yet personally gotten to that step to muddle through it yet... I was indeed expecting something 'create-sibling'ish, and had not decided a service (Gin appears in some other exercises I've been through). I just don't want the students to have to spend too much time struggling through authentication details, which, while important, would be a distraction to the mission

OK. I think GIN is perhaps the lesser evil in this case because it's free and open, even if students will have to configure SSH keys. I'm guessing the output from the containerized workflow will be large in terms of file size? This would make publishing to a GitHub sibling difficult, which would probably be the easiest alternative in case workflow outputs are small enough and/or text-based.

For the datalad-based RDM training workshops that we've been running lately (also using JupyterHub), we decided on GIN because of these (and probably other) reasons. Here's a detailed walk-through of that content (which we can repurpose for this educational session if needed): https://psychoinformatics-de.github.io/rdm-course/03-remote-collaboration/index.html#publishing-datasets-to-gin

yarikoptic commented 2 years ago

with hope of possibly reducing confusion in students of "why I fork/clone on/from github but upload to Gin", you could avoid Gin and/or add it at the end as "you can distribute storage to multiple location", by utilizing github's LFS, per http://handbook.datalad.org/en/latest/basics/101-139-gitlfs.html . A figure showing the flow here

          master                     master+                   master++                          master++
          git-annex                  git-annex+               git-annex++                        git-annex++
GitHub: OpenNeuroDatasets --(fork)--> /ReproNim  --(fork)--> /{personal}   LFS: {annex objects}  Gin: /{personal} {annex objects}

with below arrows to local clone depicting flow from/to them above could help establish a mental picture in students. Those +'s to mean that there is some extra commits on top of previous state of the branch. But illustrating the importance of having both branches (master and git-annex) to depict 1. the version of data (master); 2. information about data availability (git-annex)

jsheunis commented 2 years ago

I like the idea of git LFS, thanks @yarikoptic!

@dnkennedy do you know more or less what size of the combined output from the run operation will be? AFAICT, LFS has a storage limit of 1GB for their free version. Hopefully the outputs are less than that?

dnkennedy commented 2 years ago

OK, I took a shot at the 'git lfs' version. The 'token' that is necessary is the same one needs in order for 'push' to work anyway, so we do have to work through that anyway.

dnkennedy commented 2 years ago

I'm going to close this issue. We 'think' git lfs will work, but there is still some whining (by me) going on over at #10

ReproNim / OHBMEducation-2022

Notes about DataLad usage details #1