datalad / datalad-ria

Adds functionality for RIA stores to DataLad
http://datalad.org
Other
0 stars 1 forks source link

Documentation on RIA store integrity checking #36

Open mih opened 2 years ago

mih commented 2 years ago

It seems intuitive to run git annex fsck on the bare repo inside a RIA store. But this could/would (always @bpoldrack ?) corrupt the file availability records in the annex branch, because git-annex would treat the RIA git-repo location as a part of the git-annex clone network. However, the associated storage remote already points to this location.

it is unclear to me, why the git-repo config is not set to annex.ignore in the RIA store (or is it?).

bpoldrack commented 2 years ago

But this could/would (always @bpoldrack ?) corrupt the file availability records in the annex branch, because git-annex would treat the RIA git-repo location as a part of the git-annex clone network.

Yes, that's because we have the annex object tree in the same location that git-annex would expect it to be in a regular, bare git-annex repository (annex/objects). However, this object tree is unknown to the bare, plain-git repository and accessed via special remote instead. Obviously this is not an issue when there's no annexed content in the store (create-sibling-ria can create w/o any special remote and object tree). So, no, not always, but almost always.

The trouble kicks in when running a git-annex command makes annex discover a trace of a seemingly broken annex repo. It will run git-annex-init on what is supposed to be a plain git repo, indexing the object tree and hence recording availability for its uuid, when in fact the location is identical with the special remote (but annex can't possibly realize this).

It is doable to let annex deal with this correctly via sameas, but this isn't universal either. One of the reasons the object tree is decoupled from the bare repo is, that we allow to use dirhashmixed OR dirhashlower for its layout. Mixed is necessary for ephemeral clones symlinking into the store (which was a much desired feature leading to the idea of RIA stores in the first place) and it's datalad's default. But annex would expect a bare repository to always use dirhashlower instead. One can have that (layout version 1 for the datasets), but it screws with the ephemeral clones (we should probably have a safeguard in the respective clone routine to not symlink into dirhashlower).

So, to avoid a git annex fsck accidentally messing up the availability info by running it in-store, I'd suggest to either:

it is unclear to me, why the git-repo config is not set to annex.ignore in the RIA store (or is it?).

I forgot. Will double-check.