datalad / datalad-ria

Adds functionality for RIA stores to DataLad
http://datalad.org
Other
0 stars 1 forks source link

Verify that SSH-based STORE operations are done in an "atomic" fashion #102

Open mih opened 1 year ago

mih commented 1 year ago

The switch to uncurl-style URL handlers forces us to recheck everything. One critical piece is that uploads (also from multiple locations simultaenousely) do not got directly to a key location (to avoid the special remote reporting a key present, although it might take hours before the content is fully uploaded), and that temporary upload locations are connection-specific (in case the same key is uploaded from two or more sites at the same time).

christian-monch commented 1 year ago

For now, I see two principal options of implementing "atomic" STORE-operations (BTW: with "STORE-operations" we refer to UrlOperations.upload, don't we?):

  1. Add an atomic_upload operation to UrlOperations (or a flag to the existing upload arguments).
  2. Add synchronization classes that would allow to "lock" elements that are identified by path. An "empty" synchronization class could be used by default.

Both approaches have pros and cons, but I would prefer the route of option 1.

Pros and cons

####### Option 1:

Pros for option 1. are: completely encapsulated in the respective UrlOperations-class; can be totally class-specific; can be engineered to "cooperate" with other methods of the UrlOperations-class, e.g. with stat (e.g. by using a temporary file name, stat would not have to be changed to report only completely uploaded files); no further external configuration is required, e.g. no synchronization handler has to be picked.

Cons for option 1. are: the specific implementation might lead to duplicated implementation of "atomic"-operation functionality; the UrlOperations class might pick up a lot of code that is related to "atomic" implementations, and thus contain code for different aspects.

####### Option 2:

Pros for option 2. are: option 2 allows separation of aspects, here of synchronization, and data transfer. Synchronization could be implemented in base-classes, e.g. stat could check wether its url is locked. In principle it would be possible to mix and match different synchronization mechanisms with instances of an UrlOperations-class, although in practice that seems not necessary and there will usually be a single, most efficient way of performing synchronization.

Cons for option 2. are: synchronization is limited to using the primitives of the sychronization classes, i.e. it cannot take advantage of efficient and basically atomic operations like "moving a file from a temporary location to a final location". Instantiating an UrlOperation would require to select a proper synchronization class, which makes the use of UrlOperations more complex (one could provide a "default"-synchronization class, but the decision has to be made at some point). The main disadvantage might be that different ria-remote-configurations may use different synchronization endpoints and therefore have no synchronization at all.

######### "My" conclusion:

This is "my" conclusion in the sense that you may arrive at another conclusion: I think that the option of an atomic upload-operation on UrlOperations-classes is the better solution due to: a) the robustness of the solution (users cannot "kill" the synchronization), and b) the freedom of the implementation to use the most efficient mechanism.

It goes without saying, but we should try to separate the aspects of synchronization and data transport code-wise as much as possible (encapsulation, delegation, ..., and refactoring to keep code-parts that refer to different aspects separate).

######### Open question:

Does UrlOperations.download also need an "atomic" option?

mih commented 1 year ago

For now, I see two principal options of implementing "atomic" STORE-operations (BTW: with "STORE-operations" we refer to UrlOperations.upload, don't we?):

No, I mean the TRANSFER STORE special remote operation. This obviously involves an upload, but it the entire operation must be "atomic" in the sense that no other remote access (from a different machine), must consider an in-transit key as "present".

Does UrlOperations.download also need an "atomic" option?

I am not aware of such a need.

This is "my" conclusion in the sense that you may arrive at another conclusion: I think that the option of an atomic upload-operation on UrlOperations-classes is the better solution due to: a) the robustness of the solution (users cannot "kill" the synchronization), and b) the freedom of the implementation to use the most efficient mechanism.

This is also my conclusion, and also the behavior of the ora remote in datalad-core. The main advantage for "upload to TMP -> mv to TARGET) is that all "synchronization" cost is paid on STORE only, and RETRIEVE needs no additional test operations to be performed.

I am not certain whether UrlOperations.upload would benefit from an "atomic" mode, or whether it should always be in that mode.