Closed sanga8 closed 4 months ago
We discussed this on Slack, but for visibility: query result reuse (QRR) is a feature that reuses the result set that is written to S3 by SELECT
queries. CTAS and other data transformation statements are not supported by QRR since they don't produce result sets. Even the SELECT
part of INSERT INTO SELECT …
does not write a result set – it is fed directly into the INSERT
operation. This means that there is nothing to reuse.
The Athena documentation about QRR doesn't explicitly say that only SELECT
statements are supported, it's only implied. You have to know that only SELECT
statements produce query results to understand that CTAS, UNLOAD
, INSERT INTO
, MERGE
, etc. are not supported. I've proposed some new wording to make this more clear.
Thank you for clarification @iconara. This feature fit the dbt mindset of writing models as SELECT
queries and should be useful.
These are the dangers of leaky abstractions. Behind the scenes there are very few SELECT
queries, everything gets rewritten by dbt.
I agree with what it's being raised by @iconara , there are different edge cases to cover in order to implement "query" caching for CTAs/Insert/Merge statements that might lead to awful behaviours.
@sanga8 could you provide why would you need such feature on the writing/transformation layer?#
Query reuse make tons of sense for SELECT statements, for example to speed up serving query, but as you use dbt as a transformation layer, having queries running a bit longer in most of the case is just fine.
The benefits I saw would be to use the feature on intermediate models that are re-used across final models to improve performances.
The benefits I saw would be to use the feature on intermediate models that are re-used across final models to improve performances.
This is easily achievable persisting the model itself as a table (not as a view or ephemeral) and then referencing to it from the other models. It's something that the team that I'm working with does all the time. Anyhow such use cases doesn't motivate IMHO query caching, but using data persistence of then model is the way go to. This approach works the same way with other query engines/systems, as I've being using it for example in Redshift/Snowflake.
You suggest to persist the data which is perfectly fine and the team I'm working with is also doing it. I believed this feature from Athena would store the data in-memory for more efficiency.
I don't think that query result feature of Athena will help you here, as stated above query result doesn't work for CTAs/Insert/Merge statements - therefore I'm closing this issue.
Another approach that could help you is the usage of ephemeral models.
If you have the same ephemeral model that is reused multiple time in the same caller model, athena reuse the result of the ephemeral model (simply a CTE), and you benefit from query result re-usage, that works only in specific situations.
Is this your first time submitting a feature request?
Describe the feature
Support the Athena query result cache feature: https://docs.aws.amazon.com/athena/latest/ug/reusing-query-results.html
When you re-run a query in Athena, you can optionally choose to reuse the last stored query result. This option can increase performance and reduce costs in terms of the number of bytes scanned. Reusing query results is useful if, for example, you know that the results will not change within a given time frame. You can specify a maximum age for reusing query results. Athena uses the stored result as long as it is not older than the age that you specify.
Feature could be implemented in the profile configuration:
In pyathena these params can be passed to the connection handler.
Describe alternatives you've considered
No response
Who will this benefit?
No response
Are you interested in contributing this feature?
No response
Anything else?
No response