If you are looking for the best JDBC driver for Amazon Athena, have a look at Amazon Athena JDBC driver v3. The driver in this repo is no longer maintained, but one of the authors was the lead on the new official driver.
If you're reading this on GitHub, please note that this is the readme for the development version and that some features described here might not yet have been released. You can find the readme for a specific version via the release tags (here is an example).
This is a JDBC driver for AWS Athena.
Using Maven you can add this to your dependencies:
<dependency>
<groupId>io.burt</groupId>
<artifactId>athena-jdbc</artifactId>
<version>${athena-jdbc.version}</version>
</dependency>
Check the releases page for the value of ${athena-jdbc.version}
.
The driver requires Java 8 or later.
The only dependency is AWS SDK, specifically software.amazon.awssdk:athena
. See pom.xml
for more details.
The driver registers itself with java.sql.DriverManager
automatically, and accepts JDBC URLs with the subprotocol athena
. The subname is the default database name for the connection, and is optional. The simplest possible JDBC URL is "jdbc:athena"
, which is equivalent to "jdbc:athena:default"
. To use a database called "test" as the default database for the connection use the URL "jdbc:athena:test"
.
When using a JDBC URL to connect you shouldn't need the driver class name, but if you for some reason need them, driver class name is io.burt.athena.AthenaDriver
and the data source class name is io.burt.athena.AthenaDataSource
.
There are three connection properties:
region
: the AWS region to connecto to. The AWS SDK will automatically pick up the value of the AWS_REGION
environment variable if it is set.outputLocation
: the location in Amazon S3 where the query results will be stored. This property is required unless workGroup
is set to a work group that has a configured output location. See the API docs for more information.workGroup
: the name of the work group in which to run the query. See the API docs for more information.These properties are the same for both the java.sql.DriverManager
and javax.sql.DataSource
APIs.
DriverManager
import java.sql.Connection;
import java.sql.DriverManager;
import java.util.Properties;
Properties properties = new Properties();
properties.setProperty("region", "us-east-1"); // or set AWS_REGION
properties.setProperty("outputLocation", "s3://some-bucket/and/a/prefix");
try (Connection connection = DriverManager.getConnection("jdbc:athena:default", properties)) {
// use connection
}
Currently the java.sql.Connection
returned by the driver is thread safe and doesn't need to be closed until the application stops, if at all.
DataSource
This is the least amount of code needed to obtain a connection, and you also get type safe configuration:
import io.burt.athena.AthenaDataSource;
import java.sql.Connection;
import javax.sql.DataSource;
AthenaDataSource dataSource = new AthenaDataSource();
dataSource.setRegion("us-east-1");
dataSource.setOutputLocation("s3://some-bucket/and/a/prefix")
try (Connection connection = dataSource.getConnection()) {
}
This example uses HikariCP, but of course other connection pools work too. The driver supports both JDBC URL and javax.sql.DataSource
based configurations, and both use the same property names.
import com.zaxxer.hikari.HikariConfig;
import com.zaxxer.hikari.HikariDataSource;
import java.sql.Connection;
import javax.sql.DataSource;
HikariConfig config = new HikariConfig();
config.setJdbcUrl("jdbc:athena:default"); // or setDataSourceClassName("io.burt.athena.AthenaDataSource")
config.addDataSourceProperty("region", "us-east-1"); // or set AWS_REGION
config.addDataSourceProperty("outputLocation", "s3://some-bucket/and/a/prefix");
DataSource dataSource = new HikariDataSource(config);
try (Connection connection = dataSource.getConnection()) {
// use connection
}
Once you have a java.sql.Connection
instance you can use it as you would one from any other JDBC driver.
import java.sql.ResultSet;
import java.sql.Statement;
try (
Statement statement = connection.createStatement();
ResultSet resultSet = statement.executeQuery("SELECT 'Hello from Athena'")
) {
resultSet.next();
System.out.println(resultSet.getString(1));
}
ResultSet
import io.burt.athena.AthenaResultSet;
import java.sql.ResultSet;
import java.sql.ResultSetMetaData;
import java.sql.Statement;
try (
Statement statement = connection.createStatement();
ResultSet resultSet = statement.executeQuery("SELECT 'Hello from Athena'")
) {
ResultSetMetaData metaData = resultSet.getMetaData();
if (metaData.isWrapperFor(AthenaResultSetMetaData.class)) {
AthenaResultSetMetaData unwrappedMetaData = metaData.unwrap(AthenaResultSetMetaData.class);
String queryExecutionId = unwrappedMetaData.getQueryExecutionId();
System.out.println(queryExecutionId);
}
}
By setting a client request token on a query execution you can make Athena reuse a previous result set if the exact same query has already been run. If you run the same query multiple times this can save money and improve performance.
import io.burt.athena.AthenaStatement;
import java.sql.ResultSet;
import java.sql.Statement;
import org.apache.commons.codec.binary.Hex;
try (Statement statement = connection.createStatement()) {
if (statement.isWrapperFor(AthenaStatement.class)) {
AthenaStatement unwrappedStatement = statement.unwrap(AthenaStatement.class);
unwrappedStatement.setClientRequestTokenProvider(sql -> Optional.ofNullable(Hex.encodeHexString(sql)));
}
for (int i = 0; i < 10; i++) {
try (ResultSet resultSet = statement.executeQuery("SELECT 'Hello from Athena'")) {
resultSet.next();
System.err.println(resultSet.getString(1));
}
}
}
}
The client request token provider is a Function<String, Optional<String>>
, and receives the SQL that will be executed, and should return the token to use for the request, wrapped in an java.util.Optional
.
There is already an existing JDBC driver for Athena, so why another one?
There is nothing fundamentally wrong with the existing driver. It was developed by Simba, who have a lot of experience building JDBC drivers, but who do not have much experience with Athena. Just providing a JDBC layer for running queries in Athena is not enough.
These are some of the issues we have found with the existing driver:
GetQueryExecution
in a loop with a 5ms delay, which caused a lot of throttling errors if you had a few applications using the driver. In later versions it uses a backoff strategy of sorts; it starts at 5ms and increases the delay to a configurable max value. The increase is by a factor of 20, though, which means that with the defaults it will start at 5ms and then go to 100ms, the default max, which is more or less equivalent to a fixed delay of 100ms. If you give it a higher max, say 2000ms, there will only be one more call before the max delay is reached.SQLException
, of course, but the original exceptions are usually available as the exception cause. However, since the AWS SDK is shaded these exceptions have internal class names instead of the well-known AWS SDK names, making it hard to deal with unwrapping exceptions and deal with the underlying cause.UseAwsLogger
meant that it would use the AWS SDK logging facilities, but instead it means that it will log all the AWS SDK calls it makes to its own logger. Also, to add insult to injury, because it shades the AWS SDK configuring logging for that becomes much harder. The driver does not document how the AWS SDK has been shaded, so you'll have to figure out yourself what the class names are. You'll also get a shaded log4j bundled in the JAR, which means makes the situation even more of a mess since you can't override it with, for example, the SLF4J log4j bridge.This alternative Athena JDBC driver aims to fix these issues, and hopefully even more things that we haven't even thought of yet.
If you get a java.lang.UnsupportedOperationException
it means that you've found a feature that hasn't been implemented yet. Some of these will eventually get implemented, but some may not make sense for Athena. If you rely on one of these open an issue and describe your use case.
If you get a java.sql.SQLFeatureNotSupportedException
exception it means that the feature or operation either is not supported by Athena, or that it's unclear how it could be supported. Feel free to open pull requests that adds support for it if you have an idea for how it could be done.
Before you open an issue, check the API documentation and tests, there might be clues and hints there. Looking at the commit history of a method can also lead to insights.
These are some specific limitations and known issues that you might run into:
ResultSet#getArray
always returns string arrays, because Athena does not return any type information beyond "array"
. It also does it's best splitting the array, but there is no way to tell the arrays ["hello", "world"]
and ["hello, world"]
apart. We recommend always casting to JSON for complex types, extract them using ResultSet#getString
and parse them in your own code.ResultSet#getString
and parse them in your own code.Connection#prepareStatement
is not supported. The official Athena driver tries to support prepared statements and interpolation on the client side (it's unclear if it even works), but it's not the goal of this alternative driver to do that. Athena itself does not support prepared statements or interpolation, and there is no performance gain to be had from preparing statements.GetQueryResult
and undocumented GetQueryResultsStream
API calls. This is slower for small, but significantly faster for large result sets. In the future an optimized implementation, or an implementation that uses the fastest mechanism for a given result will be used to ensure good performance for all result set sizes.DatabaseMetadata
JDBC API, which is used by some UI tools to list databases, tables, etc. This API could be implemented, but represents a significant effort. If you want to contribute to the driver this may be a good place to start.See CONTRIBUTING.md.
© 2019 Theo Hultberg and contributors, see LICENSE.txt (BSD 3-Clause).