Tomme / dbt-athena

The athena adapter plugin for dbt (https://getdbt.com)
Apache License 2.0
140 stars 79 forks source link

[BUGFIX] Override list_relations_without_caching using boto3 #105

Closed danielcmessias closed 1 year ago

danielcmessias commented 2 years ago

Relates to: #86

Whenever you do a dbt run, dbt has to do an initial 'data gathering' query to the information schema to get a list of all relations within the database you are working on. This query is found in the macro athena__list_relations_without_caching. Similar to #86, this query can hang indefinitely or fail if the database has 100+ tables.

Rather than trying to adjust the query to do batching, which is: a) trickier because we don't know the list of table names upfront (we we did with #86 since it was just the list of sources b) slow as hell, when you get up to 99 tables it takes a long time to make that information_schema query

...I've overridden the list_relations_without_caching function, changing its functionality to use boto3 to get the necessary data instead.

The only issue is to do with glue catalogs. To get able to get the list of relations through boto3 this way your tables have to be in Glue (obviously) - but in Athena you can of course have external Data Sources (such as federated queries or a cross-account glue catalog).

What I have at the moment is a check to see if the Data Catalog is a GLUE type (i.e. cross account access). If so, we have to pass the CatalogId. If not (e.g. a federated data source?) I'm just passing back to the original SQL. I think this is a suitable solution for now.

On another note, we could change the dbt docs code to use this approach too and it will be a lot faster than querying the information schema - although the same issue with awsgluecatalog applies.

This change has been running internally with us for a quite a while and seems good 🤞

Tomme commented 1 year ago

Thanks!