dbt-labs / dbt-athena

The athena adapter plugin for dbt (https://getdbt.com)
https://dbt-athena.github.io
Apache License 2.0
228 stars 100 forks source link

[Feature] Support Query Result Cache #682

Closed sanga8 closed 4 months ago

sanga8 commented 4 months ago

Is this your first time submitting a feature request?

Describe the feature

Support the Athena query result cache feature: https://docs.aws.amazon.com/athena/latest/ug/reusing-query-results.html

When you re-run a query in Athena, you can optionally choose to reuse the last stored query result. This option can increase performance and reduce costs in terms of the number of bytes scanned. Reusing query results is useful if, for example, you know that the results will not change within a given time frame. You can specify a maximum age for reusing query results. Athena uses the stored result as long as it is not older than the age that you specify.

Feature could be implemented in the profile configuration:

result_reuse_enable=True,
result_reuse_minutes=60

In pyathena these params can be passed to the connection handler.

Describe alternatives you've considered

No response

Who will this benefit?

No response

Are you interested in contributing this feature?

No response

Anything else?

No response

iconara commented 4 months ago

We discussed this on Slack, but for visibility: query result reuse (QRR) is a feature that reuses the result set that is written to S3 by SELECT queries. CTAS and other data transformation statements are not supported by QRR since they don't produce result sets. Even the SELECT part of INSERT INTO SELECT … does not write a result set – it is fed directly into the INSERT operation. This means that there is nothing to reuse.

The Athena documentation about QRR doesn't explicitly say that only SELECT statements are supported, it's only implied. You have to know that only SELECT statements produce query results to understand that CTAS, UNLOAD, INSERT INTO, MERGE, etc. are not supported. I've proposed some new wording to make this more clear.

sanga8 commented 4 months ago

Thank you for clarification @iconara. This feature fit the dbt mindset of writing models as SELECT queries and should be useful.

iconara commented 4 months ago

These are the dangers of leaky abstractions. Behind the scenes there are very few SELECT queries, everything gets rewritten by dbt.

nicor88 commented 4 months ago

I agree with what it's being raised by @iconara , there are different edge cases to cover in order to implement "query" caching for CTAs/Insert/Merge statements that might lead to awful behaviours.

@sanga8 could you provide why would you need such feature on the writing/transformation layer?#

Query reuse make tons of sense for SELECT statements, for example to speed up serving query, but as you use dbt as a transformation layer, having queries running a bit longer in most of the case is just fine.

sanga8 commented 4 months ago

The benefits I saw would be to use the feature on intermediate models that are re-used across final models to improve performances.

nicor88 commented 4 months ago

The benefits I saw would be to use the feature on intermediate models that are re-used across final models to improve performances.

This is easily achievable persisting the model itself as a table (not as a view or ephemeral) and then referencing to it from the other models. It's something that the team that I'm working with does all the time. Anyhow such use cases doesn't motivate IMHO query caching, but using data persistence of then model is the way go to. This approach works the same way with other query engines/systems, as I've being using it for example in Redshift/Snowflake.

sanga8 commented 4 months ago

You suggest to persist the data which is perfectly fine and the team I'm working with is also doing it. I believed this feature from Athena would store the data in-memory for more efficiency.

nicor88 commented 4 months ago

I don't think that query result feature of Athena will help you here, as stated above query result doesn't work for CTAs/Insert/Merge statements - therefore I'm closing this issue.

Another approach that could help you is the usage of ephemeral models.

If you have the same ephemeral model that is reused multiple time in the same caller model, athena reuse the result of the ephemeral model (simply a CTE), and you benefit from query result re-usage, that works only in specific situations.