dbt-labs / dbt-athena

The athena adapter plugin for dbt (https://getdbt.com)
https://dbt-athena.github.io
Apache License 2.0
228 stars 100 forks source link

[Bug] dbt-athena leaks memory via workgroups cache #676

Closed namelessjon closed 4 months ago

namelessjon commented 4 months ago

Is this a new bug in dbt-athena?

Current Behavior

dbt memory usage when using the dbt-athena adapter raises to very high levels, eventually capping out at over 2gb for the full project (~150 models, ~1.3k tests)

I believe the problematic line is https://github.com/dbt-athena/dbt-athena/blob/main/dbt/adapters/athena/impl.py#L220

editing the code and trying different values for maxsize for the lru_cache results in memory usage (as assessed by mprof) plateauing at different levels. I think what is happening is that because the athena client is being used as part of the cache key for the lru_cache, but the calling code generates a new client object for each call of the cached method. This means both that the cache is not doing what is expected, and we get multiple copies of both the client and the response cached.

Expected Behavior

dbt maintains reasonable memory usage (sub 1gb?)

Steps To Reproduce

  1. Get a large-ish dbt project
  2. Configure a dbt athena connection profile which defines a workgroup. work_group: primary should work, it just has to be explicitly defined - unfortunately I can't test this, as my account at work doesn't allow running queries via the primary workgroup
  3. Run the dbt built using mprof mprof run dbt build (or run dbt build and observe via top)

Environment

- OS: MacOSX Sonoma & python:3.11-slim-bookworm docker image
- Python: 3.11.7
- dbt: 1.8.2
- dbt-athena-community: 1.8.2

Additional Context

This is the memory usage graph generated with the @lru_cache at its default value of maxsize=128

maxsize128

After editing the code to make this maxsize=64, you can see the memory usage decreases slightly, and it saturates sooner

maxsize64

With maxsize=4, the memory usage saturates fairly quickly, and is much lower

maxsize4

Also, I'm happy to implement a fix for this, but wanted to raise it first.

nicor88 commented 4 months ago

How about making the maxsize configurable with a reasonable default value. 4 might be too low, but maybe something like 8 or 16? Also I believe that in your scenario you notice high memory consumption because you have quite some model and tests. I'm curious to know what is your thread count, there must be a correlation to that value as well, but anyhow well spotted. Fell free to propose a cange, and I will be more than happy to review it.

namelessjon commented 4 months ago

I'm not suggesting this should be the default, but for most configurations shouldn't the maxsize need to be just 1? Or is the work_group also valid as a per model configuration?

Those were done with thread count 4, but prod runs with thread count 8 and reached a very similar level (I have no nice mprof graph for that though). I'm assuming the model load is that first quick rise up to 270mb or so, but that is just a guess.

nicor88 commented 4 months ago

The idea of the workgroup configuration caching is to avoid to make to many calls per model for the same endpoint.

I'm happy to have maxsize configurable as you suggested. maxsize == 1 it doesn't help, because means that for every model we do a call to get the workgroup configuration. I believe that setting maxsize by default to something like 8 or 16 might be ideal, and then we leave the final user to tweak this parameter based on their setups.

namelessjon commented 4 months ago

What I meant was "There should only be one value for work_group in play for any given run of dbt". We might need to look up the config details a lot (once per model as you say), but we're looking up the same group each time, which is why there's a benefit to having a cache at all.