datalad / datalad-ria

Adds functionality for RIA stores to DataLad
http://datalad.org
Other
0 stars 1 forks source link

Switch/abandon ORA abstraction paradigm #30

Open mih opened 2 years ago

mih commented 2 years ago

The ORA remote uses an internal IO abstraction that aims to make handling uniform across protocols (file://, ssh://, http(s)://) while everything is going through a single special remote implementation.

This sounds nice on paper, but creates a complex problem of supporting a uniform set of operations a using these exact same operations across all implementations. The present implementation fails to deliver on this promise.

I'd argue that a simpler system can be implemented that is more in line with the paradigm preferred by git-annex. Rather than having a single complex beast, let's have the individual pieces implemented properly (one protocol per implementation). Rather than supporting push/pull URL combinations in a single remote, let's use two in such cases (with --same-as), one for pull, and possibly another one, or none at all for push. Rather than fiddling with the internal parameterization of a single special remote type, let's switch externaltype= when a reconfig is required.

This will make the code base simpler, easier to maintain, and most importantly enable 3rd-party extensions without having to touch -core code.

bpoldrack commented 2 years ago

I see two things to think about:

1.)

Rather than fiddling with the internal parameterization of a single special remote type, let's switch externaltype= when a reconfig is required.

This comes with the implication, that we can't have a local reconfig (which we recently introduced), since externaltype needs to be committed as far as I'm aware. Protocol switching is an inherently local thing, though. Hence, if we switch to that approach we either get back to committed reconfig ping-pong or remove that entirely, meaning one would start going over HTTP although on same file system ...

2.) Just generally: Special remotes are committed, therefore we need a "backwards compatibility shim" of sorts. This would need to be a layer that is actually still a special remote (ORA) and then "redirects" to different, protocol specific implementations (classes) based on its config. But then we have build the thing I wanted (that complex beast) anyway and the question would be: What do we need the different special remotes for?

mih commented 2 years ago

This comes with the implication, that we can't have a local reconfig (which we recently introduced), since externaltype needs to be committed as far as I'm aware. Protocol switching is an inherently local thing, though. Hence, if we switch to that approach we either get back to committed reconfig ping-pong or remove that entirely, meaning one would start going over HTTP although on same file system ...

I could not come up with a use case that would require a local reconfiguration. AFAIR all such scenarios lead to problems down the line. Most, if not all, consumption scenarios are fully addressed via https://github.com/datalad/datalad/issues/5835 (stale), in which committing a local reconfiguration is not an issue.

Just generally: Special remotes are committed, therefore we need a "backwards compatibility shim" of sorts. This would need to be a layer that is actually still a special remote (ORA) and then "redirects" to different, protocol specific implementations (classes) based on its config. But then we have build the thing I wanted (that complex beast) anyway and the question would be: What do we need the different special remotes for?

I don't understand what you are saying. The current system can stay in place forever. If it works for people with its limitations, nothing needs to be done on their end. And there are no redirections needed.

bpoldrack commented 2 years ago

I could not come up with a use case that would require a local reconfiguration. AFAIR all such scenarios lead to problems down the line.

I think we had a bunch of cases where one would want to have a local clone from a store that is also served over HTTP/SSH. Operations on such a local clone would ideally not go via HTTP/SSH. All issues with that, that I remember were that either we couldn't detect whether this is needed or that the reconfiguration was committed. Both led to changes and particularly with local reconfiguration a lot of trouble in that regard should be addressed. In a scenario where the store is served via HTTP and some people also need local clones, I would assume, that the latter are more likely to not be pure consumption scenarios.

I don't understand what you are saying. The current system can stay in place forever.

Yes, that's what I am saying. It needs to stay in some shape. But it would ideally try to share code with the new special remotes you aim for, rather than us having two implementations, I think. Hence, it seems to me, that it would evolve into the very thing this approach tries to avoid.

Anyway, that's not a fundamental objection. May be it helps getting there. However, the more special remote types there are (and are part of existing datasets) the more we need to maintain.

If we figure a way to have a proper RIA abstraction along the way, that can be used with pretty much any (special) remote, that would be cool nevertheless.

mih commented 2 years ago

think we had a bunch of cases where one would want to have a local clone from a store that is also served over HTTP/SSH. Operations on such a local clone would ideally not go via HTTP/SSH. All issues with that, that I remember were that either we couldn't detect whether this is needed or that the reconfiguration was committed. Both led to changes and particularly with local reconfiguration a lot of trouble in that regard should be addressed. In a scenario where the store is served via HTTP and some people also need local clones, I would assume, that the latter are more likely to not be pure consumption scenarios

Can you describe a concrete case, where this is desired, and that is not a plain consumption (read-only) case?

bpoldrack commented 2 years ago

Can you describe a concrete case, where this is desired, and that is not a plain consumption (read-only) case?

RIA store, which for consumption is set up to be served over HTTP. Dataset maintainer/curator making updates to datasets in a store. For large data additions, a local clone is desired, b/c of awesome network allowing to quickly download lots of stuff into a local clone, committing and pushing to the store locally, rather than downloading elsewhere and pushing over SSH. However, that machine is not meant for computation. Hence, other users need to push their results from other machines over SSH. Without reconfiguration this can only be captured by different datasets, I think. Might be advisable anyway to separate such datasets, but we can't really know that in the general case to what extend say preprocessed data needs to be mixed in that sense (downloads + results). Furthermore, smaller fixes, deletions, etc. could be done by the curator from a different machine (and do the work offline) to later push over SSH. Arguably, the latter is "only" convenience, but depending on work/network conditions (think rise in remote work), that convenience may be quite desirable.

Am I making sense?

bpoldrack commented 2 years ago

However, this business might be addressable by having yet another sameas remote (+ proper costs obv.) rather than reconfiguration. May be that's the better way.

mih commented 2 years ago

I think the scenario you describe should be covered by the normal "ephemeral clone" setup, which can directly interface any store on the local machine. All file content is directly available.

The only case not covered is a store that hosts file content in 7z archives. So taken together, it would not cover the use case of having to modify existing file content in a dataset, kept in a store with archive.7z, and push the modified content back to that store.

That seems like a corner case. If there was an archive 7z before, there will likely have to be one after the update too. And if so, a push doesn't give that. Instead the archive file needs to be updated by a manual process outside the special remote universe.

mih commented 2 years ago

I just came across the need to turn a provided special remote configuration into a working one (configured URL was no accessible to me, but the location was accessible via another channel).

❯ git annex initremote mynewname --private --sameas=abda3a9a-8581-4c60-9f27-b6264fa8c0b1 type=<type> <whatever-is-different-too> [exporttree=yes]

Has worked great. It promises to create a new special remote that is not shared and points to the original source.

Worth nothing that one can override type and all parameters that need changing as expected. However, it does not inherit exporttree=yes -- which I found confusing initially, but ultimately also made sense.