S3: getObject, ResponseInputStream close not working as expected.

JstDust commented 3 years ago

Describe the bug

When you invoke close on ResponseInputStream from the getObject before you reach the end of the data. The stream will not immediately(/within reasonable time) close on big files.

Expected Behavior

When invoking close on an ResponseInputStream you would expect a close within at least a few seconds.

Current Behavior

It does not close until it did a complete read of the stream internally. (this seems related to connection recycling internally).

Steps to Reproduce

Upload a big file to S3 atleast a 1-2 gb to observe the issue on fast internet connection

var r = client.getObject(c -> {
        c.bucket(bucket).key(key).versionId(versionId);
    });
r.read(new byte[1024]); // just to make sure we actually opened the resource and did something.
r.close(); // now you have to wait until the entire file is read by the close internally.

Possible Solution

The internal handler closes the stream immediately, possibly by simply closing the underlying connection.
It could make a decision based on the remaining bytes to read to make the impact less significant on big files (e.g. if remaining > 1 MB close connection, otherwise consume).

Context

I was trying to create a test case for a wrapper i created for handling network related issues (e.g. connection reset). Where i would simply track the position in the stream and use a range request when the connection fails, to continue the stream transparently. When testing it by creating a wrapper around the ResponseInputStream that throws randomly a IO exception i noticed the extreme delay when closing the ResponseInputStream.

Workaround

 var r = client.getObject(c -> {
            c.bucket(bucket).key(key).versionId(versionId);
        });
r.read(new byte[1024]); // just to make sure we actually opened the resource and did something.
r.abort(); // <<< Workaround, but this is not part of a InputStream contract
r.close(); // now the connection closes almost immediately.

Your Environment

AWS Java SDK version used: 2.15.14
JDK version used: 11
Operating System and version: windows 10

debora-ito commented 3 years ago

It looks like abort is what you are looking for, according to the ResponseInputStream documentation:

If it is not desired to read remaining data from the stream, you can explicitly abort the connection via abort(). Note that this will close the underlying connection and require establishing an HTTP connection which may outweigh the cost of reading the additional data.

I'm not sure what is the expected behavior of close after a read has started, I'll do some research.

JstDust commented 3 years ago

@debora-ito It's fine that there is a abort() call for this. but it simply does not work well when you use wrappers to actually read the contents of a stream e.g. a simple InputStreamReader, or GZipInputStream etc.

Those are often returned directly like below, at that point the abort() and the whole S3 backend is lost to the caller.

public Reader readText(String bucket, String key){
  return new InputStreamReader(client.getObject(.....), StandardCharSets.UTF_8);
}

And the only thing that remains of the contract is the close() operation. In the java docs these are very clearly documented. https://docs.oracle.com/javase/8/docs/api/java/io/InputStream.html#close-- https://docs.oracle.com/javase/8/docs/api/java/io/InputStreamReader.html#close--

"Closes the stream and releases any system resources associated with it."

I do not consider it a issue that on EOF the connection is returned to the pool. But if for some reason it gets closed early, continuing to read in the close() call is effectively not part of the contract. And it does cause long waits when something fails. e.g. when using a try-resource statement like below.

try(BufferedReader reader = (...s3call...)){
  String line;
  while((line=reader.readLine()) != null){
    if(line.length() < 10) throw new IllegalStateException("Line to short");
    ....some code...
  }
} // now have fun waiting if that is the first line in the multi gigabyte file.

So i consider it a contract violation of the InputStream#close() contract that getObject implements a InputStream and on close read is continued. Also the gains of this aggressive pool return seems minimal at most specially on early close. Most of the time it's probably more IO and CPU friendly to to just close the underlying connection if the connection can not be returned immediately to the pool when close is invoked.

shearn89 commented 2 years ago

This appears to happen with other streams as well. I'm reading from a Kinesis Video stream that returns a GetMediaResponse for the important bit (reading the data). When I'm done, for example because I've hit my stopFragmentNumber, I stop processing. When I then call close() on the ResponseInputStream<GetMediaResponse> object, it takes about 3 seconds to close.

shearn89 commented 2 years ago

I can use the same workaround by doing:

        ResponseInputStream<GetMediaResponse> inputStream = (ResponseInputStream<GetMediaResponse>) kvsStreamTrackObject.getInputStream();
        inputStream.abort();

...although I then also have to call .release() else it throws an error on close.

hohwille commented 6 months ago

I fully agree to @JstDust. As some temporary workaround one would have to create its own S3InputStream class that delegates all method calls to a wraped ResponseInputStream but in the close() method it will first call abort() and then close().

@debora-ito IMHO there is no big thing to clarify as this is just the contract that every Java developer would expect. I can imagine that it is a huge effort to get all the S3 clients for every different programming languages perfect and it may also be tricky to consider changing existing behavior but this IMHO the way to go. Java developers just want a seamless API experience for S3 and after all S3 stands for "Simple Storage Service" so developers would love to have it simple instead of figuring out all the things out themselves and add workarounds to gaps in the SDK such as also the demand to put via an OutputStream:

Could we expect things like this to be solved out of the box by the SDK from the top #1 hyperscaler? From my PoV this would be lovely...

aws / aws-sdk-java-v2