Tomme / dbt-athena

The athena adapter plugin for dbt (https://getdbt.com)
Apache License 2.0
142 stars 79 forks source link

dbt generate docs hangs when database has > 100 tables #79

Closed JustasCe closed 2 years ago

JustasCe commented 2 years ago

While adopting dbt-athena we noticed that for some of our databases dbt docs generate hangs indefinitely.

After some investigation we found that it happens when using the athena__get_catalog macro for a db with more than 100 tables. This is not a dbt-athena issue itself, rather an AWS Athena internal issues. We ended up talking with AWS support and they mentioned that an issue most likely related to this is something they are working on, but have no timeline for when it's going to be fixed.

In the mean time the workaround that we were advised to use is to also filter by tables. After some manual testing we found that splitting the catalog query to filter by databases AND tables (in batches of 100) solves the issue. Sample of the query:

select *
from (
        (
            with tables as (
                ...
            ),
            columns as (
                ...
            ),
            catalog as (
                ...
            )
            select *
            from catalog
            where lower("table_schema") = lower('very_large_database')
                and (
                    "table_name" = lower('table_1')
                    or "table_name" = lower('table_2')
                                         ....
                    or "table_name" = lower('table_100')
                    )
            union all
            select *
            from catalog
            where lower("table_schema") = lower('very_large_database')
                and (
                    "table_name" = lower('table_101')
                    or "table_name" = lower('table_102')
                                         ....
                    or "table_name" = lower('table_200')
                    )
.....
    )

These tables can be retrieved from the manifest as part of the adapter get_catalog().

roy-ht commented 2 years ago

Hi @JustasCe ,

74 also solves this problem.

I override _get_one_catalog method which internally calles Glue API directly (If your env have related permissions).