datalad / datalad-ria

Adds functionality for RIA stores to DataLad
http://datalad.org
Other
0 stars 1 forks source link

Document the challenge to implement RIA store cleanups #104

Open mih opened 1 year ago

mih commented 1 year ago

It is not sufficient to just deal with the most basic key-operations in the ORA remote. At least when we (know the we) are operating with a filesystem on the remote.

We need to figure out, how we want to tackle this. Ideally without making unnecessary assumptions about the particular capabilties of effective URL handlers.

codecov-commenter commented 1 year ago

Codecov Report

Attention: 1 lines in your changes are missing coverage. Please review.

Comparison is base (18e5e3a) 92.09% compared to head (bca2735) 91.97%.

Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #104 +/- ## ========================================== - Coverage 92.09% 91.97% -0.13% ========================================== Files 17 17 Lines 405 411 +6 Branches 32 32 ========================================== + Hits 373 378 +5 - Misses 15 16 +1 Partials 17 17 ``` | [Files](https://app.codecov.io/gh/datalad/datalad-ria/pull/104?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=datalad) | Coverage Δ | | |---|---|---| | [datalad\_ria/tests/test\_ora.py](https://app.codecov.io/gh/datalad/datalad-ria/pull/104?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=datalad#diff-ZGF0YWxhZF9yaWEvdGVzdHMvdGVzdF9vcmEucHk=) | `100.00% <100.00%> (ø)` | | | [datalad\_ria/ora\_remote.py](https://app.codecov.io/gh/datalad/datalad-ria/pull/104?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=datalad#diff-ZGF0YWxhZF9yaWEvb3JhX3JlbW90ZS5weQ==) | `73.52% <50.00%> (-1.48%)` | :arrow_down: |

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

christian-monch commented 1 year ago

Thanks for the code, that is pretty much along the way what I did, but a lot more to the point. Wasn't sufficiently familiar with uncurl to do it as nicely.

As far as I can see, there is one thing missing. Unless I misinterpret the code in uncurl.py, there is no mechanism to ensure that checkpresent does not return True for a given key, until the key content is completely uploaded, i.e. transfer_store is done[^1]. That has to be ensured across different git-annex-remote-ora2 instances. In fact, it should be ensured in uncurl-instances as well. Currently checkpresent is based on URLOperations.stat. While some UrlOperation-instances might not report a file to be present during the upload of that file, others, e.g. FileUrlOperation-instances, might report it to be present during upload.

Synchronization has to be performed between different processes that are potentially executed on different hosts (although the latter is probably quite unlikely). A possible solution to this problem would be to add an upload-marker to a temporary location on the destination. The upload-marker could be the existence of a file like upload-<key>. There are a number of challenges with regard to the precise URL for such a mark, especially with the templates that uncurl supports. A possible, but ugly, solution might be to require the specification of a temporary location when initializing an uncurl-remote.

In the case of ORA2Remote, the problem is simpler. Since we control the URL-rewriting template, we could define the following location to hold upload-marker:

f'{base_url}/{dsid[:3]}/{dsid[3:]}/transfer/upload-{annex_key}'

Then we place such a marker into the location, until the upload has been performed. The usual problems of removing the marker, if the git annex remote crashes without removing the marker apply.

Another approach might be possible: if the stat-method of the UrlOperations-instance supports reliable reporting of the size of the remotely stored object. In this case would stat the remote object to check whether it is completely present on the remote. This requires that the size is always contained within the annex key.

[^1]: "CHECKPRESENT Key [...] It's important that, while a key is being transferred to a remote, CHECKPRESENT not indicate it's present in the remote until all the data has been sent." in https://git-annex.branchable.com/design/external_special_remote_protocol/

mih commented 1 year ago

Thanks @christian-monch ! I don't want to expand the discussion here much -- it is only somewhat related to this particualr PR. But here is a related issue: https://github.com/datalad/datalad-next/issues/454