cs3org / cs3apis

:arrows_clockwise: Connect Storage and Application Providers
https://buf.build/cs3org-buf/cs3apis
Apache License 2.0
53 stars 29 forks source link

Clarified scope of lock_id in InitiateFileUploadRequest #226

Closed glpatcern closed 8 months ago

glpatcern commented 8 months ago

The rationale of this PR is that an InitiateFileUpload request shall only provide an endpoint, for the client to perform the upload as a separate step. Therefore, any lock validation MUST be performed again at the actual upload time, when the lock_id is to be passed again, and as such there's no need to include it in the payload of InitiateFileUplod.

How is Reva/edge dealing with atomically ensuring the lock is valid currently? @wkloucek and/or @micbar can you shed some light? My only guess is that you'd validate the lock again during the TUS upload. In that case, we may just shortcut.

For Reva/master, this is in the makings (at relatively low priority), that's why I realized only now about this.

micbar commented 8 months ago

@butonic @dragotin fyi

glpatcern commented 8 months ago

From @micbar:

Regarding the lock checking: it should be done during initiateFileUpload and on any later step too.

glpatcern commented 8 months ago

Fair enough, after all lock_id is optional and can stay, but the issue remains: the ultimate place where to check is the actual upload.

So possibly a question to @butonic : how is Reva edge validating the lock context on PUT? Which headers do you use/enforce on the clients for them to pass the lock_id? Can we agree on them so to properly implement https://github.com/cs3org/wopiserver/pull/139 ?

glpatcern commented 8 months ago

I'm merging this as it's only a clarification in the comments, the semantic does not change.

The details how to enforce locking by the data transfer protocol returned in response are yet to be followed up in https://github.com/cs3org/wopiserver/pull/139

butonic commented 8 months ago

So possibly a question to @butonic : how is Reva edge validating the lock context on PUT? Which headers do you use/enforce on the clients for them to pass the lock_id? Can we agree on them so to properly implement cs3org/wopiserver#139 ?

There are no direct PUT uploads to the dataprovider. Uploads MUST start with a call to InitiateFIleUpload. That will start an UploadSession, which also carries the Lock. When the UploadSession is finished, and the postprocessing outcome allows the upload to be registered as a new file revision (or as a new file) the lock will be checked before aquiring the low level file lock on the file metadata.

Upload sessions can always be written to, regardless of the protocol used to transfer the bytes. edge uses asynchronous postprocessing, so we check the lock twice: once in ̀InitiateUpload and once before creating the revision after postprocessing. Since all upload protocols only write to a temporary location, they all check the lock twice, because the lock is not relevant for the byte transfer. Only at the beginning (to prevent unnecessary uploads) and at the end for the actual check.

glpatcern commented 8 months ago

Thanks @butonic for these insights. Granted that the dataprovider does not serve direct uploads, we do exactly the same (and we also have an atomic move after uploading to a temp location, though this is an implementation detail). Yet that does not change the question: if I understand correctly, your UploadSession must actually be persisted, along with the InitiateFileUpload.lock_id metadata, such that the lock check in your postprocessing hook can be done - which is the check that ultimately rules, the initial one being an optimization to avoid unnecessary uploads, as you said. Is that the case?

What I implied with my proposal was to "leave the burden" of keeping the state (the lock metadata) on the client side, and enforce that clients pass it on the actual upload (see how I implemented that in the cs3org/wopiserver#139 PR), for the dataprovider to dispatch it to e.g. the UploadSession. Does that make sense?

butonic commented 8 months ago

Yes that is the case. UploadSessions are persisted as a separate json file to keep track of the parent id, the name, the nodeid, the lock, any metadata assigned with the upload, the current offfset and full file size ... regarding the implementation detail we use googles renameio package for the atomicity.

I like the idea of using a JWT as the upload id! Unfortunately, the upload session also has to keep track of the postprocessing state. When all bytes have been transferred we start ASYNC postprocessing and may receive a virus detected event or a delete event. Also if postprocessing is interrupted we need to be able to pick up and restart unfinished uploads. We need to keep track of that upload state. Furthermore, I want to be able to show an incoming upload in the web ui. This requires keeping track of state on the server side.

glpatcern commented 8 months ago

Unfortunately, the upload session also has to keep track of the postprocessing state. When all bytes have been transferred we start ASYNC postprocessing and may receive a virus detected event or a delete event. Also if postprocessing is interrupted we need to be able to pick up and restart unfinished uploads. We need to keep track of that upload state. Furthermore, I want to be able to show an incoming upload in the web ui. This requires keeping track of state on the server side.

Understood, it appears you have much more state to be kept server-side than we do with Reva master, where actually some events (e.g. a concurrent delete) are handled by the storage, and we don't store anything else within Reva.

Anyway, decorating the data transfer with the lock context via dedicated HTTP headers would not cause any harm - you're relying on the URL returned by the dataprovider to be unguessable, like we do, and you're not going to read that additional payload for enforcing the lock at the end of the transfer, but for the protocol specification I still see it makes sense to not impose a stateful implementation.