Needed: A section on working in shared repositories/datasets

This topic may warrant a standalone section or chapter ("Collaboration - advanced"?). I'm collecting a few things I find to be noteworthy.

Cloning datasets in --reckless shared mode (relevant git documentation)
Creating a dataset from data you don't own and setting core.sharedRepository by hand

I've stumbled across the following usecase: I took over a project started by someone else. We're in the same Unix group. I force-created a dataset, and wanted to snapshot the state of the project the other person left behind. All of the files have the following permission set:

$ ls -l somedata
-rwxrwxrwx 1 <not-me> <group-I-am-in> 755 Jul 13  2018 somedata

Saving this data fails with permission denied errors because git-annex can't chmod files I don't own unless its a shared repository:

[ERROR  ]   <somedata>: setFileMode: permission denied (Operation not permitted) [add(/somedata)] 
add(error): somedata (file) [ somedata: setFileMode: permission denied (Operation not permitted)]

In this usecase, a manual git config --local core.sharedRepository group setting is needed.

Noting a previous issue (#550): Shared dataset clones are not yet documented in the handbook, but useful. Quoting @mih:

aqw alerted me to people working on shared dataset clones. I haven't checked that handbook, but I cannot remember having seen a section on this. The key is to signal this intent on create or on clone. I am sure we can do the former --shared group, not sure about the latter. Just registering this demand for info at this point.

I would suggest this as a section in the upcoming chapter on HPC computing (#547).

I would also talk for a "collaboration -newbie" section, it is quite difficult to enter when working with different people on a repository, some might use the browser version to add files, some may change only one submodules. It is not straightforward when one unique users uses different computers (collaborate with yourself)...

Here are some workflow I got to when trying it out. Not sure it is the most straightforward way to work, and it would need some extra information here and there, especially about the use of -rthat I did not test completely yet. Would be also nice to explain why the default actions often do not work ("datalad update only fetch changes, would be nice to explain what it means, and why this is the default behaviour: I have been using GitHub for years and always pull, never fetch...")

would be happy to help writing this.

To install a local copy or a repo with submodules:

datalad clone XXX
datalad get -n -r .

pushing changes when someone has made change using a different tool/computer (no submodules involved, no conflict involved).

datalad update --merge
datalad save
datalad push --to origin 

#note --to origin only needed once, use dataLad siblings to get other server addresses

Cheers, and thanks for getting in touch.

I would also talk for a "collaboration -newbie" section, it is quite difficult to enter when working with different people on a repository, some might use the browser version to add files, some may change only one submodules. It is not straightforward when one unique users uses different computers (collaborate with yourself)...

I can definitely see a need for what you are describing here. While there are some things that I would regard as being covered, some things certainly aren't yet.

What is currently covered are a few basic, solely DataLad-based collaboration scenarios. One is the entire (chapter 4) Collaboration. This is stepping through the clone, make changes and save, update --merge ensemble within the narrative, plus a lot of extra background information such as "how do subdatasets/submodules look when cloned", "how can I use someone else's run records", "how does git-annex record file availability with two dataset clones?". The other is the "collaborative data management routine" usecase. That one leaves out the background information and goes through a clone, make changes and save, update --merge or push routine quite fast. They were quite easy to write because they use a simplified assumption that all collaborators have access and similar permissions, leave out specific services, and are only DataLad-based.

There is a big remainder of collaboration usecases left untouched, still. I can see some of them reflected in your description. I'll brain storm a bit on what would be useful, and I would appreciate your opinion whether one or more matches what you describe or if there is other stuff you would think is helpful. Just to let you know upfront: I'll later need to single everything that comes out of this out into individual issues, because it likely exceeds the scope of this issue here.

1) Collaboration via pull requests and repository-hosting services (GitHub, Gin, GitLab, ...).

We so far have only integrated this topic in interactive workshops e.g. https://psychoinformatics-de.github.io/rdm-course/03-remote-collaboration/index.html, https://youtu.be/3ePgH-kK8h8?t=1599, but I can see it being a new, stand-alone chapter in the Basics part of the book, best placed in between "Third party infrastructure" on publishing (currently chapter 8) and "Help yourself" (currently 9). This chapter would need a mix between general information, and platform specific walk-throughs. The general part would need to include Git concepts such as branches, forks and best practices for proposing changes, as well as repository hosting concepts such as pull/merge requests (a Findoutmore with explanations for those names and how they can differ between hosting services would be helpful for beginners), and a quick recap on DataLad-related concepts such as the difference between content in Git and git-annex. Maybe some general information on major platforms as done in the Third party infrastructure chapter (http://handbook.datalad.org/en/latest/basics/101-139-hostingservices.html). The platform-specific walk-through should show, ideally with screenshots, how to create, find, and act on pull/merge requests for different platforms. This can be simple (when there's just content in Git), or complex (when annexed data is modified but we're using a not so straightforward special remote for storing the data). This overlaps with #674. There are some pre-existing materials (e.g., a wrapper for OSF for template flow, how to do a PR with annexed data on Gin), but there certainly are more usecases to cover than there is time to write - nevertheless, better to have something than nothing. Having material for GitHub and Gin would probably be a good start. Specific examples you gave (making a change in the web browser versus locally, how do I update a submodule) can become Findoutmores or subsections, as applicable.

2) Collaboration - Git and/or/with DataLad

This collaboration case wouldn't be basic because it kinda requires pre-existing knowledge about Git, but for people that are used to working with Git I can see a section on "Whats the difference between collaborating with DataLad versus with Git"/"How can I add DataLad to Git-based collaboration workflow and vice-versa"/"Whats easier or different with DataLad" as being helpful. It would incorporate a lot of information used in this FOSDEM talk (sources) on the differences between the tools https://www.youtube.com/watch?v=Yrg6DgOcbPE and would probably either become a usecase (with a warning that one needs to know Git) OR it could become a part of a new chapter in the Advanced section of the book (working title "DataLad and Git", could also include the section on fixing mistakes with Git commands and going back in time from "Help yourself).

3) Advanced collaboration

This is a chapter that would address this current issue, too. I would need to be in the advanced part of the book and deal with less-known collaboration scenarios such as shared permissions in large projects (i.e., collaborating in the same dataset), collaboration without a repository hosting service (e.g., a private Git server or bare repository), collaboration with datasets in RIA stores.

As for the workflow you sketched out, I have a few brief comments and will just put them in between your code:

To install a local copy or a repo with submodules:

datalad clone XXX datalad get -n -r .

Unless one wants to make a chance in a submodule, it is not necessary to get subdatasets recursively, but of course possible (only in case of superdatasets with a large amount of subdatasets it would be discouraged)

pushing changes when someone has made change using a different tool/computer (no submodules involved, no conflict involved).

datalad update --merge datalad save datalad push --to origin

this last one depends on your sibling origin. E.g., if its your own repo on GitHub this will work, if its someone else's repo on GitHub and you don't have permissions it wouldn't work. It could be useful to include that standard "new branch" approach from Git into it, just to document it with best practices.

note --to origin only needed once, use dataLad siblings to get other server addresses

correct

would be happy to help writing this.

This would be great. If we can carve out one or more usecases in this issue, I can sketch a skeleton with a few action items, and you could work on those that you feel comfortable with, or help shape the general approach.

Thanks for the extensive reply. I am quite new to datalad, and get a bit scared by its complexity. I start to get the feeling the tool was developed for a different type of workflow (huge datasets in several levels of submodules, working mostly on the local version of the data, working with repositories where I only have read access) than what I am used to do, and therefore the default behavior of the commands surprises me, and finding the right information in the handbook takes some time.

I like the idea to work with use cases, as one could enter the datalad world from different approaches. I will try to look for basic git/github tutorials and adapt it to use with datalad. I will also continue to dig into the handbook a bit more. and come back to you.

Sounds good. Do let me know if you have problems or questions, I think I could locate the right resources a bit faster. :)

I start to get the feeling the tool was developed for a different type of workflow (huge datasets in several levels of submodules, working mostly on the local version of the data, working with repositories where I only have read access) than what I am used to do, and therefore the default behavior of the commands surprises me, and finding the right information in the handbook takes some time.

It would be great to hear where the default behavior of commands is surprising. Issues here or in the datalad main repo are appreciated. I hope this doesn't come across as "mansplainy", just adding a few bits and pieces in hopes they are helpful in response to what you were worrying about: When DataLad commands are unsatisfactory, plain Git or git-annex commands work out of the box - completely okay to use them, where they are applicable. While DataLad makes collaboration and version control on the largest scale possible (e.g., https://github.com/psychoinformatics-de/fairly-big-processing-workflow), I think the more common usecase are small repositories. Here's one of ours, with intermittent collaborative development over ~7 years so far, sometimes solely Git, or git-annex, or DataLad https://github.com/psychoinformatics-de/studyforrest-data-annotations - maybe helpful, although pretty organically grown.

hey, I am back to this... I was pleased to see that a repo obtained via gin get can accept datalad commands, the same as when using datalad clone. In both cases though, there is none of the datalad-specific content created with datalad create. One would need to set datalad default behavior by hand, or is there a command to set datalad in a exisiting repo ?

PS: I am doing this at https://gindata.biologie.hu-berlin.de/tonictests/test_1_3.main (which will soon disappear.)

Also taking notes via https://rpubs.com/j_colomb/883834

datalad-handbook / book