gaul / s3proxy

Access other storage backends via the S3 API
Apache License 2.0
1.79k stars 232 forks source link

Azure multipart upload issue #708

Open jerbob92 opened 1 month ago

jerbob92 commented 1 month ago

When using the minio client (and thus also when using the Go SDK), I have some files that just can't be uploaded when using the default settings. The only way I can upload them is by changing the --part-size, The default is 16MiB, but when I change it to 20 it works. I have disabled multipart completely for now to prevent any issues.

As far as I know there is nothing special to these files, most files like these upload just fine. It's a PDF file and it's 286697813 bytes. I use mc put from the local filesystem to copy them into Azure via s3proxy.

The error code that I get back from s3proxy is:

<?xml version='1.0' encoding='UTF-8'?><Error><Code>BadDigest</Code><Message>Bad Request</Message><RequestId>4442587FB7D0A2F9</RequestId></Error>

The error that is in the s3proxy logs is:

│ [s3proxy] E 10-24 17:56:15.230 S3Proxy-Jetty-4835 o.j.h.h.BackoffLimitedRetryHandler:88 |::] Cannot retry after server error, command is not replayable: [method=org.jclouds.azureblob.AzureBlobClient. │
│ public abstract void org.jclouds.azureblob.AzureBlobClient.putBlock(java.lang.String,java.lang.String,java.lang.String,org.jclouds.io.Payload)[test, test, AAKYEA==, [content=true, contentM │
│ etadata=[cacheControl=null, contentDisposition=null, contentEncoding=null, contentLanguage=null, contentLength=16777216, contentMD5=null, contentType=application/unknown, expires=null], written=false │
│ , isSensitive=false]], request=PUT https://url.blob.core.windows.net/test/test?comp=block&blockid=AAKYEA%3D%3D HTTP/1.1]                                                  │
│ [s3proxy] E 10-24 17:56:15.288 S3Proxy-Jetty-4826 o.j.h.i.JavaUrlHttpCommandExecutorService:93 |::] error after writing 15597568/16777216 bytes to https://url.blob.core.windows.net │
│ /test/test?comp=block&blockid=AAJxAA%3D%3D                                                                                                                                                   │
│ org.gaul.shaded.org.eclipse.jetty.io.EofException: Early EOF                                                                                                                                            │
│     at org.gaul.shaded.org.eclipse.jetty.server.HttpChannelOverHttp.markEarlyEOF(HttpChannelOverHttp.java:287)                                                                                          │
│     at org.gaul.shaded.org.eclipse.jetty.server.HttpChannelOverHttp.earlyEOF(HttpChannelOverHttp.java:266)                                                                                              │
│     at org.gaul.shaded.org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:1634)                                                                                                                │
│     at org.gaul.shaded.org.eclipse.jetty.server.HttpConnection.parseRequestBuffer(HttpConnection.java:414)                                                                                              │
│     at org.gaul.shaded.org.eclipse.jetty.server.HttpConnection.parseAndFillForContent(HttpConnection.java:348)                                                                                          │
│     at org.gaul.shaded.org.eclipse.jetty.server.HttpChannelOverHttp.parseAndFillForContent(HttpChannelOverHttp.java:129)                                                                                │
│     at org.gaul.shaded.org.eclipse.jetty.server.HttpChannelOverHttp.produceContent(HttpChannelOverHttp.java:115)                                                                                        │
│     at org.gaul.shaded.org.eclipse.jetty.server.AsyncContentProducer.produceRawContent(AsyncContentProducer.java:470)                                                                                   │
│     at org.gaul.shaded.org.eclipse.jetty.server.AsyncContentProducer.nextTransformedContent(AsyncContentProducer.java:357)                                                                              │
│     at org.gaul.shaded.org.eclipse.jetty.server.AsyncContentProducer.isReady(AsyncContentProducer.java:277)                                                                                             │
│     at org.gaul.shaded.org.eclipse.jetty.server.BlockingContentProducer.nextContent(BlockingContentProducer.java:111)                                                                                   │
│     at org.gaul.shaded.org.eclipse.jetty.server.HttpInput.read(HttpInput.java:287)                                                                                                                      │
│     at org.gaul.shaded.org.eclipse.jetty.server.HttpInput.read(HttpInput.java:272)                                                                                                                      │
│     at com.google.common.io.ByteStreams.read(ByteStreams.java:929)                                                                                                                                      │
│     at com.google.common.io.ByteStreams.readFully(ByteStreams.java:797)                                                                                                                                 │
│     at com.google.common.io.ByteStreams.readFully(ByteStreams.java:781)                                                                                                                                 │
│     at org.gaul.s3proxy.ChunkedInputStream.read(ChunkedInputStream.java:56)                                                                                                                             │
│     at org.gaul.s3proxy.ChunkedInputStream.read(ChunkedInputStream.java:70)                                                                                                                             │
│     at com.google.common.hash.HashingInputStream.read(HashingInputStream.java:68)                                                                                                                       │
│     at com.google.common.io.ByteStreams$LimitedInputStream.read(ByteStreams.java:742)                                                                                                                   │
│     at java.base/java.io.FilterInputStream.read(Unknown Source)                                                                                                                                         │
│     at org.jclouds.io.ByteStreams2.copy(ByteStreams2.java:68)                                                                                                                                           │
│     at org.jclouds.http.internal.JavaUrlHttpCommandExecutorService.writePayloadToConnection(JavaUrlHttpCommandExecutorService.java:302)                                                                 │
│     at org.jclouds.http.internal.JavaUrlHttpCommandExecutorService.convert(JavaUrlHttpCommandExecutorService.java:175)                                                                                  │
│     at org.jclouds.http.internal.JavaUrlHttpCommandExecutorService.convert(JavaUrlHttpCommandExecutorService.java:66)                                                                                   │
│     at org.jclouds.http.internal.BaseHttpCommandExecutorService.invoke(BaseHttpCommandExecutorService.java:97)                                                                                          │
│     at org.jclouds.rest.internal.InvokeHttpMethod.invoke(InvokeHttpMethod.java:91)                                                                                                                      │
│     at org.jclouds.rest.internal.InvokeHttpMethod.apply(InvokeHttpMethod.java:74)
│     at org.jclouds.rest.internal.InvokeHttpMethod.apply(InvokeHttpMethod.java:45)                                                                                                                       │
│     at org.jclouds.rest.internal.DelegatesToInvocationFunction.handle(DelegatesToInvocationFunction.java:156)                                                                                           │
│     at org.jclouds.rest.internal.DelegatesToInvocationFunction.invoke(DelegatesToInvocationFunction.java:123)                                                                                           │
│     at jdk.proxy2/jdk.proxy2.$Proxy72.putBlock(Unknown Source)                                                                                                                                          │
│     at org.jclouds.azureblob.blobstore.AzureBlobStore.uploadMultipartPart(AzureBlobStore.java:450)                                                                                                      │
│     at jdk.internal.reflect.GeneratedMethodAccessor7.invoke(Unknown Source)                                                                                                                             │
│     at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)                                                                                                               │
│     at java.base/java.lang.reflect.Method.invoke(Unknown Source)                                                                                                                                        │
│     at com.google.inject.internal.DelegatingInvocationHandler.invoke(DelegatingInvocationHandler.java:50)                                                                                               │
│     at jdk.proxy2/jdk.proxy2.$Proxy73.uploadMultipartPart(Unknown Source)                                                                                                                               │
│     at org.gaul.s3proxy.S3ProxyHandler.handleUploadPart(S3ProxyHandler.java:2851)                                                                                                                       │
│     at org.gaul.s3proxy.S3ProxyHandler.doHandle(S3ProxyHandler.java:748)                                                                                                                                │
│     at org.gaul.s3proxy.S3ProxyHandlerJetty.handle(S3ProxyHandlerJetty.java:80)                                                                                                                         │
│     at org.gaul.shaded.org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:122)                                                                                                  │
│     at org.gaul.shaded.org.eclipse.jetty.server.Server.handle(Server.java:563)                                                                                                                          │
│     at org.gaul.shaded.org.eclipse.jetty.server.HttpChannel$RequestDispatchable.dispatch(HttpChannel.java:1598)                                                                                         │
│     at org.gaul.shaded.org.eclipse.jetty.server.HttpChannel.dispatch(HttpChannel.java:753)                                                                                                              │
│     at org.gaul.shaded.org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:501)                                                                                                                │
│     at org.gaul.shaded.org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:287)                                                                                                      │
│     at org.gaul.shaded.org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:314)                                                                                      │
│     at org.gaul.shaded.org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:100)                                                                                                                │
│     at org.gaul.shaded.org.eclipse.jetty.io.SelectableChannelEndPoint$1.run(SelectableChannelEndPoint.java:53)                                                                                          │
│     at org.gaul.shaded.org.eclipse.jetty.util.thread.strategy.AdaptiveExecutionStrategy.runTask(AdaptiveExecutionStrategy.java:421)                                                                     │
│     at org.gaul.shaded.org.eclipse.jetty.util.thread.strategy.AdaptiveExecutionStrategy.consumeTask(AdaptiveExecutionStrategy.java:390)                                                                 │
│     at org.gaul.shaded.org.eclipse.jetty.util.thread.strategy.AdaptiveExecutionStrategy.tryProduce(AdaptiveExecutionStrategy.java:277)                                                                  │
│     at org.gaul.shaded.org.eclipse.jetty.util.thread.strategy.AdaptiveExecutionStrategy.run(AdaptiveExecutionStrategy.java:199)                                                                         │
│     at org.gaul.shaded.org.eclipse.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run(ReservedThreadExecutor.java:411)                                                                         │
│     at org.gaul.shaded.org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:969)                                                                                                 │
│     at org.gaul.shaded.org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.doRunJob(QueuedThreadPool.java:1194)                                                                                       │
│     at org.gaul.shaded.org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:1149)                                                                                            │
│     at java.base/java.lang.Thread.run(Unknown Source) 

All the blocks that it send are 16800342 bytes, except for the last one, which is 1487296 bytes. All other blocks are uploaded fine, except for the last one, which results in the error above.

Any idea what's going on here? I'm going to see if I can debug this some more tomorrow.

Other files that are failing are of sizes: 291995023, 286683989, 286904511, 287128205, 304781589, 293607881

I'm running the version from this commit: https://github.com/gaul/s3proxy/commit/356c12f83869f8285bd19a40af9f2d3e09f5cd07

gaul commented 1 month ago

Could you test with both the azureblob and azureblob-sdk providers? The latter is newer and will receive most of my development effort going forward. This is not yet in a release so you will have to compile from master.

jerbob92 commented 1 month ago

@gaul Luckily every commit is also pushed as docker image, so it's easy to test! :) I tested it out with the docker image sha-6185f2b which is this commit:https://github.com/gaul/s3proxy/commit/6185f2b46fa3421ded2b5e3c36d7a7f03f18a12d (the latest on master)

However, with that I have the following error:

[s3proxy] W 10-25 07:03:05.892 reactor-http-epoll-33 r.n.h.client.HttpClientConnect:304 |::] [0190fddc-1, L:/172.17.0.2:41086 - R:url.blob.core.windows.net/ip:443] The connection observed an error
java.io.UncheckedIOException: java.io.IOException: mark/reset not supported
    at com.azure.storage.common.Utility.lambda$convertStreamToByteBuffer$2(Utility.java:261)
    at reactor.core.publisher.FluxDefer.subscribe(FluxDefer.java:46)
    at reactor.core.publisher.FluxSubscribeOn$SubscribeOnSubscriber.run(FluxSubscribeOn.java:194)
    at reactor.core.scheduler.WorkerTask.call(WorkerTask.java:84)
    at reactor.core.scheduler.WorkerTask.call(WorkerTask.java:37)
    at java.base/java.util.concurrent.FutureTask.run(Unknown Source)
    at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(Unknown Source)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
    at java.base/java.lang.Thread.run(Unknown Source)
Caused by: java.io.IOException: mark/reset not supported
    at java.base/java.io.InputStream.reset(Unknown Source)
    at java.base/java.io.FilterInputStream.reset(Unknown Source)
    at com.azure.storage.common.Utility.lambda$convertStreamToByteBuffer$2(Utility.java:259)
    ... 9 common frames omitted
[s3proxy] E 10-25 07:03:05.896 boundedElastic-4 c.azure.storage.common.Utility:531 |::] java.io.IOException: mark/reset not supported

So I should rather try to make azureblob-sdk work instead of trying to fix azureblob?

gaul commented 1 month ago

@jerbob92 please test the latest master. azureblob is the jclouds-based provider that S3Proxy has used for 10 years which has some shortcomings in terms of authentication and I plan to abandon it. azureblob-sdk is newer and uses the Microsoft SDK. I am more eager to fix any bugs for the latter. But I will happily accept PRs to fix the former.

gaul commented 4 weeks ago

Sigh, this fix is wrong. I am improving the testing for azureblob-sdk and didn't test MPU end-to-end.

gaul commented 4 weeks ago

One workaround is to wrap the InputStream in a BufferedInputStream. However this could use an unbounded amount of memory. I don't believe that the SDK provides the API S3Proxy needs to stream a uncommitted block. I filed a feature request with their team to see if they will add a new API or there is some other way to express this.

jerbob92 commented 3 weeks ago

Not sure if I understand correctly, does this mean that:

It's not really clear to me why Azure SDK would require it to be seekable, but since multipart uploads are meant for smaller blocks, I would say wrapping it in a BufferedInputStream isn't that bad, until a fix is available in Azure SDK?

jerbob92 commented 3 weeks ago

@gaul I have found the bug. The error from Azure is:

code='InvalidQueryParameterValue', message='Value for one of the query parameters specified in the request URI is invalid.
context='{QueryParameterValue=AAK_IA==, QueryParameterName=blockid, Reason=Not a valid base64 string.}

When looking at the code of jclouds, the makeBlockId function:

BaseEncoding.base64Url().encode(Ints.toByteArray(partNumber));

It uses base64Url(), which uses _ as a base64 char, something that Azure does not like. jclouds uses the azure-storage-blob package, and I imagine that they handle URL encoding of the Block ID, so I don't know why they selected base64Url. I would say that just using base64 would be enough.

Edit: it looks like this was fixed already? https://github.com/apache/jclouds/commit/6ef293dfd34f2af0ef45bacd04247c3e8afe0261 And you approved it https://github.com/apache/jclouds/pull/208

gaul commented 3 weeks ago

For the original azureblob issue, please edit S3Proxy's pom.xml and set jclouds.version to 2.7.0-SNAPSHOT and test this. jclouds is not in a healthy state and I am unsure if I can release another version that includes this fix.

For the azureblob-sdk issue, I haven't looked at the SDK source code but please test your suggested BufferedInputStream and report back? Originally Azure only supported 4 MB block sizes so this might have been reasonable workaround but now part sizes can be as large as 4 GB: https://learn.microsoft.com/en-us/azure/storage/blobs/scalability-targets . S3Proxy disables all retries (6610b14ea51d8603e406dd02a64ce150afc4f819) to allow the client to retry so the mark/reset support is not a good requirement for this project. I am waiting for a response from the SDK team but implementing the OutputStream API should be straightforward so I don't want to look at workarounds yet.

jerbob92 commented 3 weeks ago

When I try that I get the following error:

The POM for org.apache.jclouds:jclouds-allblobstore:jar:2.7.0-SNAPSHOT is missing, no dependency information available

When I use 2.6.1-SNAPSHOT I can build it, and the issue is indeed resolved.

It shouldn't be too hard to roll a 2.6.1 release that just contains this fix right?

First change master to be 2.7.0-SNAPSHOT

Then create a new branch for 2.6 maintenance, reset it to the commit that the tag 2.6.0 refers to and cherry pick the base64 fix into it:

git checkout -b 2.6.x
git reset --hard 173e3a4a49d910ad46d77a508a2ba7b67abf31fa
git cherry-pick 6ef293dfd34f2af0ef45bacd04247c3e8afe0261

Then change the version on the 2.6.x branch to 2.6.1. I'm not sure if that works with distribution, but since there is also a 2.4.x branch I think this was done before like this.

Regarding the part size limit, I had the default part size of 16MB in mind when thinking of the workaround, when you use a way larger part size then it can become problematic indeed.

Ideally the azureblob will be fixed as well since azureblob-sdk is quite new (and probably not fully ready).

gaul commented 3 weeks ago

It shouldn't be too hard to roll a 2.6.1 release that just contains this fix right?

The Apache release process is more involved and I would prefer to spend the ~10 hours doing more useful things if possible. That project is dying which was the motivation to write azureblob-sdk.

Ideally Microsoft will respond to my SDK issue. Otherwise I will send them a PR. There are alternatives like using the existing sub-part chunking logic for earlier Azure APIs that might be a good workaround if you want to investigate them. Let's leave this issue open until the path forward is more clear.

jerbob92 commented 3 weeks ago

The Apache release process is more involved and I would prefer to spend the ~10 hours doing more useful things if possible. That project is dying which was the motivation to write azureblob-sdk.

Ah ok, that sounds like a pain then if creating a release takes that much time.

There are alternatives like using the existing sub-part chunking logic for earlier Azure APIs that might be a good workaround if you want to investigate them.

What do you mean with this?

I'll fork and build my own docker image for now.

gaul commented 3 weeks ago

S3Proxy currently has logic to map S3 > 4 MB parts into a sequence of Azure <= 4 MB parts:

https://github.com/gaul/s3proxy/blob/5d960b3eedf09bd9b05b8dc517256805e3c1476d/src/main/java/org/gaul/s3proxy/S3ProxyHandler.java#L2837

I believe that this should limit memory usage of your BufferedInputStream. Actually now that I look at this now, I fixed the older azureblob implementation to support 100 MB parts (but not 5 GB, so subparts are still needed). But you could artificially limit this yourself to 4 MB by using a different value for azureMaximumMultipartPartSize.

I am past my limit on discussing hacks to make things work in the short-term. I only have enthusiasm to work on the correct long-term solution so please self-support and report back if you have something useful.

jerbob92 commented 3 weeks ago

I'm not planning on using a workaround for azureblob-sdk, even if I would, the memory usage of BufferedInputStream would not be an issue for me, since my parts are always a max of 16MB. You brought up the potential memory issue.

I'm just trying to get multipart working for larger files right now without making major changes or using something untested. Since the bug is already fixed in jclouds I have made my own docker image that has the latest version of s3proxy + the jclouds snapshot, since it might be useful for someone else, here it is: https://github.com/jerbob92/s3proxy/pkgs/container/s3proxy%2Fcontainer