Open jonathanswenson opened 1 year ago
@jonathanswenson thank you for the detailed explanation. I believe it's the same issue reported here: https://github.com/aws/aws-sdk-java-v2/issues/4083.
If so, we have a task to investigate it further.
I run into this same issue recently.
And your deduction is correct that the wrapping and not copying is the problem. One can not reasonably expect OutputStream#write()
to return before it has either written the buffer or copied it.
As for now this renders the whole BlockingOutputStreamAsyncRequestBody
pretty much pointless as you either need to trust in pure luck by adding some sleeps and waits or buffer your whole input into memory, and at that point you can use other more reliable ways to upload.
@jonathanswenson an easy workaround is to use
final AsyncRequestBody body = AsyncRequestBody.fromPublisher(publisher);
final CompletableFuture<UploadPartResponse> resp = asyncClient.uploadPart(request, body);
And then something like this:
public class QueuePublisher extends OutputStream implements Publisher<ByteBuffer> {
private final BlockingQueue<QueueBuffer> queue = new LinkedBlockingQueue<>(10);
private final long contentLength;
private volatile long pos = 0;
record QueueBuffer(byte[] buffer, int length) {}
private Subscriber<? super ByteBuffer> subscriber;
public QueuePublisher(long contentLength) {
this.contentLength = contentLength;
}
@Override
public void write(int b) throws IOException {
// no op
}
@Override
public void write(byte[] buffer, int off, int len) throws IOException {
final byte[] internalBuffer = new byte[len];
System.arraycopy(buffer, 0, internalBuffer, 0, len);
try {
queue.put(new QueueBuffer(internalBuffer, len));
} catch (InterruptedException e) {
subscriber.onError(e);
throw new IOException(e);
}
}
@Override
public void subscribe(Subscriber<? super ByteBuffer> subscriber) {
this.subscriber = subscriber;
this.subscriber.onSubscribe(new QueueSubscriber());
}
class QueueSubscriber implements Subscription {
private final AtomicBoolean done = new AtomicBoolean(false);
@Override
public void request(long n) {
if (done.get()) return;
for (int i = 0; i < n; i++) {
if (done.get()) break;
send();
}
}
private synchronized void send() {
try {
QueueBuffer qb = queue.take();
subscriber.onNext(ByteBuffer.wrap(qb.buffer()));
pos += qb.length;
if (pos == contentLength) {
done.set(true);
subscriber.onComplete();
}
} catch (InterruptedException e) {
subscriber.onError(e);
throw new RuntimeException(e);
}
}
@Override
public void cancel() {
// TODO implement
}
}
}
It will buffer a little bit more but in the end that should be no more than 10 x your write buffer size.
I'm moving potentially large files (several hundred MiB) before doing multipart, and even with multipart I have quite big parts so I cannot use anything other than streaming.
I haven't had the time to test this more but there is of course potential that this solution will send too much stuff downstream and I need to throttle it more according to the demand, but that remains to be seen.
I also encountered the same corruption of uploaded files via BlockingOutputStreamAsyncRequestBody. No fix yet? This issue is almost 1 year old
Describe the bug
Using
BlockingOutputStreamAsyncRequestBody
(viaAsyncRequestBody.forBlockingOutputStream(...)
) and sharing the byte array between subsequent writes to the output stream, leads to data corruption when uploading a stream to S3 using async java sdk.at a high level the write pattern is as follows (full code snippet below).
Expected Behavior
I expect that using re-using an byte array between writes to an OutputStream does not lead to corrupt data.
Current Behavior
The data written to the output stream does not match the data that is written to s3.
Reproduction Steps
gradle imports:
Possible Solution
believe this is happening due to wrapping, but not copying the bytes passed to the output stream.
In https://github.com/aws/aws-sdk-java-v2/blob/master/utils/src/main/java/software/amazon/awssdk/utils/async/OutputStreamPublisher.java#L71 the byte buffer is wrapped for compatibility with async nio / publisher APIs. However, due to a lack of immutability and an expectation of blocking behavior from the OutputStream API, this leads to the wrapped data being mutated before it is successfully passed to the CRT library.
Likely what needs to happen here is the data needs to be copied before the write call returns.
Additional Information/Context
Originally I filed https://github.com/awslabs/aws-crt-java/issues/658 with the aws-crt-java library
However I figured out that there is a similar but slightly different problem when using the CRT library -- the CRT library reports success when the correupted data is uploaded, while the standard (non-crt sdk) throws an error:
AWS Java SDK version used
2.20.118
JDK version used
openjdk version "17.0.2" 2022-01-18 LTS OpenJDK Runtime Environment Zulu17.32+13-CA (build 17.0.2+8-LTS) OpenJDK 64-Bit Server VM Zulu17.32+13-CA (build 17.0.2+8-LTS, mixed mode, sharing)
Operating System and version
Mac OSX 13.4.1 (M1)