awslabs / aws-athena-query-federation

The Amazon Athena Query Federation SDK allows you to customize Amazon Athena with your own data sources and code.
Apache License 2.0
552 stars 288 forks source link

[FEATURE] Create a columnar variant of BlockWriter/BlockSpiller #1

Open avirtuos opened 4 years ago

avirtuos commented 4 years ago

Currently, BlockWriter and BlockSpiller encourage a row wise approach to writing results. These interfaces are often viewed as simpler than there would be columnar equivalents. Even though many of the systems that we've integrated with using this SDK do not themselves support columnar access patterns, there is value in offering a a variant of these mechanisms that provide the skeleton for columnar writing of results.

The current SDK versions take the approach that experts can drop into 'native' Apache Arrow mode and simply not use these abstractions. This approach of making common things easy and still enabling access to a 'power user' mode is one we'd like to stick with but we'd also like to make it easier for customers that can/want a more columnar experience to be able to do so more easily.

Some of the key goals of this new facility would be to alleviate the performance penalty associated with all the field vector lookups and type conversion object overhead that the current row wise convince facades introduce. Depending on the source system being integrated with, these changes can improve cells/second throughput between 20% - 30% in our testing. The improvement is more dramatic when there is limited parallelism / pipelining available to hide this inefficiency.

avirtuos commented 4 years ago

This has been started by the UDF workstream using a projector and writer interface which can be found here: https://github.com/awslabs/aws-athena-query-federation/tree/master/athena-federation-sdk/src/main/java/com/amazonaws/athena/connector/lambda/data

Whoever picks up this task should start by reviewing this work as well as the existing BlockUtils usage.

avirtuos commented 4 years ago

A good way to test / iterate on perf changes is to use the Example connector inside the SDK itself. This connector can generate a partitioned and split table with varying column types. You can then set your Lambda concurrency to 1 by using reserved concurrency and get a prety accurate measure of what a single Lambda invocation is capable of. Right now it seems we hover around 150 mbps or 18MBps when using the below query and a Lambda concurrency of 1. This is an imperfect measure since it includes Presto time as well. You can refine how you measure it but this isn't a bad way to do it since it include client experience.

select count(bigint), count(smallint), count(col4), count(varchar), count(int),count(float4),count(decimalLong) from "lambda:generator".custom_source.fake_table
where year = 2019 and month=1 and day =2

I also tweaked the settings for this Lambda as it support some configs to help with perf testing. I set the NUM_ROWS_PER_SPLIT env variable to 8000000 so that each Split is more meaningful in size and thus less general overhead. This more closely matches a source that support no parallelism and runs a large query. Alternatively you can have the connector generate just a single split by adding a config for that.

macohen commented 5 months ago

@atennak1 will you be completing this issue?

macohen commented 4 months ago

I'm interested to understand if this issue is still needed. If I understand this correctly, at the time this issue was created, all data was written to the spill buckets by row. This could be fine for a row based data store or a row based destination file format, but for columnar stores, less processing would be needed to go from, say DynamoDB or Cassandra to Parquet with the proposed change. Is that right?

I'm also curious, 4.5 years later, regardless of whether I understand this correctly or not, if we should pursue this further or close it. I'm also going to unassign @atennak1.

cc: @burhan94, @chngpe