cue-labs / oci

Go modules related to OCI (Open Container Initiative) registries
Apache License 2.0
22 stars 3 forks source link

direct download/upload #2

Open rogpeppe opened 1 year ago

rogpeppe commented 1 year ago

Currently all data read from and written to an ociregistry.Interface implementation flows directly through that implementation. This doesn't fit with the spirit of the OCI API which allows hand-off of reads and writes via the Location header and HTTP redirection to arbitrary other data providers.

Reads

The read-related entry points return BlobReader values which have a Descriptor method to return information about the blob. This is not directly reflected in the HTTP API, which only returns size, digest, and content-type information when reading. This means that the rest of the descriptor can be potentially be used by ociregistry.Interface implementations to return other metadata about the blob.

Specifically, the Descriptor.URLs field seems potentially appropriate for returning information about a location from which a blob can be directly downloaded.

Specifically the ociclient implementation could, when making a GetBlob request, detect that there's an HTTP redirect and, instead of just following the redirect, return a BlobReader that includes the redirect URL in its descriptor. When this happens, the client would not issue the actual GET request to the redirect location, but just make a HEAD request (following redirects for some steps if needed) to avoid initiating the data stream. The first Read call on the returned BlobReader would make the actual GET request to the final redirect location, thus fulfilling the GetBlob contract for callers that don't look at Descriptor.URLs. Alternatively the caller of GetBlob would be free to go direct to that URL, or (in the case of ociserver) to return a redirect status code itself.

ociregistry.Interface wrapper implementations would be responsible for clearing out the URLs field in descriptors returned from the wrapped registry implementation if not appropriate.

Writes

The PushBlobChunked entry point can be made to work with direct uploads by adding a Location (or similarly named) method to BlobWriter holding the location returned by the initial POST request. A caller would then be able to go directly to that location to exercise the PATCH/PUT part of the protocol instead of writing to the BlobWriter. The ociserver implementation (in handleBlobStartUpload specifically) would use BlobWriter.Location to populate the Location header if non-empty, falling back to using the local relative location otherwise.

The PushBlob entry point is currently problematic with respect to direct upload because the contract is that it will always push the blob using the data from the passed-in reader. That means that when we're talking to the HTTP client and the response is Accepted, (indicating a new location to upload to) we still end up pushing the data in this way. One way of working around this might be to specify that PushBlob can return some error indicating that PushBlobChunked must be used instead, mirroring the HTTP API.

A downside of that approach is that it implies an extra round-trip, because we'll end up doing the POST request twice: once for PushBlob to see the initial error and the second time to initiate the upload via PushBlobChunked. We could include the upload ID in the error returned by PushBlob, but we'd still have an extra round trip due to the GET request sent by PushBlobChunked to recover the offset, although it might be possible to mitigate that in the common case by avoiding that call and using zero offset when we know that the upload ID has just been returned from PushBlob.

Another downside of the approach is that it makes the caller API more complex to use: we can't just invoke PushBlob and expect that it will work. Instead, we'd probably implement a higher level function (maybe ociregistry.PushBlob) that knows how to use PushBlob and fall back to using PushBlobChunked on failure.

jonjohnsonjr commented 1 year ago

Specifically, the Descriptor.URLs field seems potentially appropriate for returning information about a location from which a blob can be directly downloaded.

I do like this a lot, I feel like it puns on the intended purpose and client behavior very well.

Another thing to consider is that descriptors have annotations, so we could use that for arbitrary side-channel information in other contexts where urls might not be a perfect fit.