Open jerbob92 opened 1 month ago
Could you test with both the azureblob and azureblob-sdk providers? The latter is newer and will receive most of my development effort going forward. This is not yet in a release so you will have to compile from master.
@gaul Luckily every commit is also pushed as docker image, so it's easy to test! :)
I tested it out with the docker image sha-6185f2b
which is this commit:https://github.com/gaul/s3proxy/commit/6185f2b46fa3421ded2b5e3c36d7a7f03f18a12d (the latest on master)
However, with that I have the following error:
[s3proxy] W 10-25 07:03:05.892 reactor-http-epoll-33 r.n.h.client.HttpClientConnect:304 |::] [0190fddc-1, L:/172.17.0.2:41086 - R:url.blob.core.windows.net/ip:443] The connection observed an error
java.io.UncheckedIOException: java.io.IOException: mark/reset not supported
at com.azure.storage.common.Utility.lambda$convertStreamToByteBuffer$2(Utility.java:261)
at reactor.core.publisher.FluxDefer.subscribe(FluxDefer.java:46)
at reactor.core.publisher.FluxSubscribeOn$SubscribeOnSubscriber.run(FluxSubscribeOn.java:194)
at reactor.core.scheduler.WorkerTask.call(WorkerTask.java:84)
at reactor.core.scheduler.WorkerTask.call(WorkerTask.java:37)
at java.base/java.util.concurrent.FutureTask.run(Unknown Source)
at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(Unknown Source)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.base/java.lang.Thread.run(Unknown Source)
Caused by: java.io.IOException: mark/reset not supported
at java.base/java.io.InputStream.reset(Unknown Source)
at java.base/java.io.FilterInputStream.reset(Unknown Source)
at com.azure.storage.common.Utility.lambda$convertStreamToByteBuffer$2(Utility.java:259)
... 9 common frames omitted
[s3proxy] E 10-25 07:03:05.896 boundedElastic-4 c.azure.storage.common.Utility:531 |::] java.io.IOException: mark/reset not supported
So I should rather try to make azureblob-sdk
work instead of trying to fix azureblob
?
@jerbob92 please test the latest master. azureblob
is the jclouds-based provider that S3Proxy has used for 10 years which has some shortcomings in terms of authentication and I plan to abandon it. azureblob-sdk
is newer and uses the Microsoft SDK. I am more eager to fix any bugs for the latter. But I will happily accept PRs to fix the former.
Sigh, this fix is wrong. I am improving the testing for azureblob-sdk and didn't test MPU end-to-end.
One workaround is to wrap the InputStream
in a BufferedInputStream
. However this could use an unbounded amount of memory. I don't believe that the SDK provides the API S3Proxy needs to stream a uncommitted block. I filed a feature request with their team to see if they will add a new API or there is some other way to express this.
Not sure if I understand correctly, does this mean that:
It's not really clear to me why Azure SDK would require it to be seekable, but since multipart uploads are meant for smaller blocks, I would say wrapping it in a BufferedInputStream
isn't that bad, until a fix is available in Azure SDK?
@gaul I have found the bug. The error from Azure is:
code='InvalidQueryParameterValue', message='Value for one of the query parameters specified in the request URI is invalid.
context='{QueryParameterValue=AAK_IA==, QueryParameterName=blockid, Reason=Not a valid base64 string.}
When looking at the code of jclouds, the makeBlockId
function:
BaseEncoding.base64Url().encode(Ints.toByteArray(partNumber));
It uses base64Url()
, which uses _ as a base64 char, something that Azure does not like. jclouds uses the azure-storage-blob package, and I imagine that they handle URL encoding of the Block ID, so I don't know why they selected base64Url
. I would say that just using base64
would be enough.
Edit: it looks like this was fixed already? https://github.com/apache/jclouds/commit/6ef293dfd34f2af0ef45bacd04247c3e8afe0261 And you approved it https://github.com/apache/jclouds/pull/208
For the original azureblob issue, please edit S3Proxy's pom.xml
and set jclouds.version
to 2.7.0-SNAPSHOT and test this. jclouds is not in a healthy state and I am unsure if I can release another version that includes this fix.
For the azureblob-sdk issue, I haven't looked at the SDK source code but please test your suggested BufferedInputStream
and report back? Originally Azure only supported 4 MB block sizes so this might have been reasonable workaround but now part sizes can be as large as 4 GB: https://learn.microsoft.com/en-us/azure/storage/blobs/scalability-targets . S3Proxy disables all retries (6610b14ea51d8603e406dd02a64ce150afc4f819) to allow the client to retry so the mark/reset support is not a good requirement for this project. I am waiting for a response from the SDK team but implementing the OutputStream
API should be straightforward so I don't want to look at workarounds yet.
When I try that I get the following error:
The POM for org.apache.jclouds:jclouds-allblobstore:jar:2.7.0-SNAPSHOT is missing, no dependency information available
When I use 2.6.1-SNAPSHOT
I can build it, and the issue is indeed resolved.
It shouldn't be too hard to roll a 2.6.1 release that just contains this fix right?
First change master to be 2.7.0-SNAPSHOT
Then create a new branch for 2.6 maintenance, reset it to the commit that the tag 2.6.0 refers to and cherry pick the base64 fix into it:
git checkout -b 2.6.x
git reset --hard 173e3a4a49d910ad46d77a508a2ba7b67abf31fa
git cherry-pick 6ef293dfd34f2af0ef45bacd04247c3e8afe0261
Then change the version on the 2.6.x branch to 2.6.1
. I'm not sure if that works with distribution, but since there is also a 2.4.x branch I think this was done before like this.
Regarding the part size limit, I had the default part size of 16MB in mind when thinking of the workaround, when you use a way larger part size then it can become problematic indeed.
Ideally the azureblob
will be fixed as well since azureblob-sdk
is quite new (and probably not fully ready).
It shouldn't be too hard to roll a 2.6.1 release that just contains this fix right?
The Apache release process is more involved and I would prefer to spend the ~10 hours doing more useful things if possible. That project is dying which was the motivation to write azureblob-sdk
.
Ideally Microsoft will respond to my SDK issue. Otherwise I will send them a PR. There are alternatives like using the existing sub-part chunking logic for earlier Azure APIs that might be a good workaround if you want to investigate them. Let's leave this issue open until the path forward is more clear.
The Apache release process is more involved and I would prefer to spend the ~10 hours doing more useful things if possible. That project is dying which was the motivation to write azureblob-sdk.
Ah ok, that sounds like a pain then if creating a release takes that much time.
There are alternatives like using the existing sub-part chunking logic for earlier Azure APIs that might be a good workaround if you want to investigate them.
What do you mean with this?
I'll fork and build my own docker image for now.
S3Proxy currently has logic to map S3 > 4 MB parts into a sequence of Azure <= 4 MB parts:
I believe that this should limit memory usage of your BufferedInputStream
. Actually now that I look at this now, I fixed the older azureblob
implementation to support 100 MB parts (but not 5 GB, so subparts are still needed). But you could artificially limit this yourself to 4 MB by using a different value for azureMaximumMultipartPartSize
.
I am past my limit on discussing hacks to make things work in the short-term. I only have enthusiasm to work on the correct long-term solution so please self-support and report back if you have something useful.
I'm not planning on using a workaround for azureblob-sdk
, even if I would, the memory usage of BufferedInputStream would not be an issue for me, since my parts are always a max of 16MB. You brought up the potential memory issue.
I'm just trying to get multipart working for larger files right now without making major changes or using something untested. Since the bug is already fixed in jclouds I have made my own docker image that has the latest version of s3proxy + the jclouds snapshot, since it might be useful for someone else, here it is: https://github.com/jerbob92/s3proxy/pkgs/container/s3proxy%2Fcontainer
When using the minio client (and thus also when using the Go SDK), I have some files that just can't be uploaded when using the default settings. The only way I can upload them is by changing the
--part-size
, The default is 16MiB, but when I change it to 20 it works. I have disabled multipart completely for now to prevent any issues.As far as I know there is nothing special to these files, most files like these upload just fine. It's a PDF file and it's
286697813
bytes. I usemc put
from the local filesystem to copy them into Azure via s3proxy.The error code that I get back from s3proxy is:
The error that is in the s3proxy logs is:
All the blocks that it send are
16800342
bytes, except for the last one, which is1487296
bytes. All other blocks are uploaded fine, except for the last one, which results in the error above.Any idea what's going on here? I'm going to see if I can debug this some more tomorrow.
Other files that are failing are of sizes: 291995023, 286683989, 286904511, 287128205, 304781589, 293607881
I'm running the version from this commit: https://github.com/gaul/s3proxy/commit/356c12f83869f8285bd19a40af9f2d3e09f5cd07