carlspring / s3fs-nio

A Java (NIO2) FileSystem Provider for Amazon AWS S3.
https://s3fs-nio.carlspring.org/
66 stars 24 forks source link

Analyze options to use SdkHttpClient implementations #101

Open ptirador opened 3 years ago

ptirador commented 3 years ago

Task Description

The S3Factory class manages the build of a new Amazon S3 instance, which right now it's using an Apache HTTP Client.

As specified in this Pull Request discussion, this is locking in customers to the ApacheHttpClient, which adds a dependency they may not want. It's needed to provide an option for other SdkHttpClient implementations.

The UrlConnectionHttpClient is fairly popular choice in Java-based Lambda functions as it has faster startup time, so less impact to cold starts.

Tasks

The following tasks will need to be carried out:

Task Relationships

This task:

Useful Links

Help

ptirador commented 3 years ago

Pros

Use the built-in HttpUrlConnection client to reduce instantiation time

The AWS Java SDK 2.x includes a pluggable HTTP layer that allows customers to switch to different HTTP implementations. Three HTTP clients are supported out-of-the-box:

With the default configuration, Apache HTTP client and Netty HTTP client are used for synchronous clients and asynchronous clients respectively. They are powerful HTTP clients with more features. However, they come at the cost of higher instantiation time.

On the other hand, the JDK built-in HTTPUrlConnection library:

Hence, it's recommended using HttpUrlConnectionClient when configuring the SDK client. Note that it only supports synchronous API calls. If we'd like to see support for asynchronous SDK clients with JDK 11 built-in HTTP client, please upvote this GitHub issue.

Exclude unused SDK HTTP dependencies

The SDK by default includes Apache HTTP client and Netty HTTP client dependencies. If startup time is important to your application and you do not need both implementations, it's recommended excluding unused SDK HTTP dependencies to minimize the deployment package size. Below is the sample Maven POM file for an application that only uses url-connection-client and excludes netty-nio-client and apache-client.

    <dependencies>
        <dependency>
            <groupId>software.amazon.awssdk</groupId>
            <artifactId>s3</artifactId>
            <exclusions>
                <exclusion>
                    <groupId>software.amazon.awssdk</groupId>
                    <artifactId>netty-nio-client</artifactId>
                </exclusion>
                <exclusion>
                    <groupId>software.amazon.awssdk</groupId>
                    <artifactId>apache-client</artifactId>
                </exclusion>
            </exclusions>
        </dependency>
        <dependency>
            <groupId>software.amazon.awssdk</groupId>
            <artifactId>url-connection-client</artifactId>
        </dependency>
    </dependencies>

Cons

Incoveniences of using the built-in HttpUrlConnection client

As the JDK built-in HTTPUrlConnection client is more lightweight, its configuration is simpler. If compared to Apache HTTP Client, for example, you cannot configure:

FYI @carlspring @steve-todorov

carlspring commented 3 years ago

Hi @ptirador ,

Thanks for your investigation!

What do you mean by "deployment package"?

In my opinion, we need to have support for both synchronous and asynchronous requests. If the we need the Apache + Netty dependencies for this, then so be it. There are many other things that you can't do with the HTTPUrlConnection like setting up connection pools and so on, (if I recall correctly).

How much of a difference is there in terms of instantiation time?

And the other question -- are we using async requests for anything right now? What use cases would we have for this?

My only concern is that, at the moment, we claim to support JDK11 (which is, of course indeed the case), and, whatever we decide will have to make sure this does not break out JDK 11 support.

Which one is your advice and personal preference?

steve-todorov commented 3 years ago

Thanks @ptirador for raising this issue and making the initial research!

How did you come to the conclusion using the built-in HttpUrlConnection client is faster? Did you do a JMS benchmark which backs this statement with data?

Honestly, if I had to pick one of the three options above - I'd go with netty-nio-client and async connections as the default option. In my experience, using netty and proper async implementation would result in much better throughput and overall performance than using blocking / sync approach. Also, if you're already using Cassandra or something similar the chances you are already using netty are very big.

If you are up for the task - we can create a JMS benchmark which tests the different implementations so we can make a decision based on the data.

ptirador commented 3 years ago

Hi @carlspring @steve-todorov,

The conclusions that I wrote are based on this article, which talks about these instantiation times but without providing any benchmarch example. We can create this JMS benchmark to test them.

In my opinion, I will also go with Netty and async connections, specially because of the overall performance boost that it provides. Also, a few months ago we switched the NIO implementation to use AsynchronousFileChannel instead of FileChannel, so I think it could be the best way to go.

carlspring commented 3 years ago

Hi @ptirador ,

I believe you and @steve-todorov are right -- we should use Netty, since indeed we did switch to AsynchronousFileChannel, as you've just reminded me.

How much of an effort will this task be?