[Bug] Cosmos on MWAA is too resource heavy

michal-mrazek commented 3 months ago

Astronomer Cosmos Version

Other Astronomer Cosmos version (please specify below)

If "Other Astronomer Cosmos version" selected, which one?

1.4.0

dbt-core version

1.8.2

Versions of dbt adapters

dbt-snowflake==1.8.2

LoadMode

CUSTOM

ExecutionMode

VIRTUALENV

InvocationMode

DBT_RUNNER

airflow version

2.8.1

Operating System

Amazon Linux AMI

If a you think it's an UI issue, what browsers are you seeing the problem on?

No response

Deployment

Amazon (AWS) MWAA

Deployment details

No response

What happened?

Hello! We are running a cosmos in AWS MWAA instance, we have several dbt projects and we are observing that the CPU and memory are high when cosmos dag starts.

We tested a DBT project with about 40 models to compare resource utilization between BashOperator and Cosmos task groups (with max_active_tasks=10). On a small MWAA cluster, the BashOperator was performing just fine. However, cosmos was struggling, with CPU and memory peaks at 100% and random failures due to that.

So I wanted to ask, can there be anything that we are doing wrongly or is this expected behavior of cosmos? In my mind, the operation on the airflow side should not be that heavy. We also tested switching to local execution mode but we did not observe a significant difference.

Relevant log output

No response

How to reproduce

Run a dbt project with 40 models in a small MWAA cluster.

Anything else :)?

No response

Are you willing to submit PR?

[X] Yes I am willing to submit a PR!

Contact Details

mmrazek@paylocity.com

dosubot[bot] commented 3 months ago

Hello, @michal-mrazek! I'm here to help you with any bugs, questions, or contributions you have. I'll assist you while we wait for a human maintainer to review your issue.

High CPU and memory usage when running a Cosmos DAG with dbt models on a small MWAA cluster can be expected behavior, especially if you are dealing with a large dbt project. The documentation notes that larger dbt projects might cause a "DagBag import timeout" error, which suggests that resource-intensive operations are anticipated. Increasing the core.dagbag_import_timeout value in the Airflow configuration is recommended to handle such cases ^[1].

Additionally, Cosmos task groups are inherently more resource-intensive compared to the BashOperator due to the complexity and additional functionalities provided by Cosmos. These include managing environment variables, handling multiple dbt-specific flags and options, and executing dbt commands with various configurations and caching mechanisms ^[2]^[3]^[4]^[5].

To optimize performance and reduce resource usage, consider the following:

Partial Parsing: Ensure that you are leveraging dbt's partial parsing to speed up task execution and DAG parsing. This requires consistent use of the same --vars, --target, --profile, and profile.yml environment variables ^[6].
Profile Configuration: Disable mocked profile mappings by setting render_config=RenderConfig(enable_mock_profile=False) if using Cosmos profile mapping. Declare your own profiles.yml file via ProfileConfig(profiles_yml_filepath=...) ^[6].

Caching: Cosmos caches the most up-to-date partial_parse.msgpack file after running a dbt command in the system temporary directory. This reduces the need for full dbt project parsing on each run. Caching is enabled by default but can be configured or disabled using the following settings ^[6]:

[cosmos]
cache_dir = path/to/docs/here  # to override default caching directory (by default, uses the system temporary directory)
enable_cache = False  # to disable caching (enabled by default)

Or using environment variables:

AIRFLOW__COSMOS__CACHE_DIR="path/to/docs/here"  # to override default caching directory (by default, uses the system temporary directory)
AIRFLOW__COSMOS__ENABLE_CACHE="False"  # to disable caching (enabled by default)

Monitoring specific logs and metrics can also help diagnose high resource usage. The cosmos/converter.py file includes logging statements that provide performance metrics for parsing the dbt project and building the Airflow DAG ^[3].

By following these optimizations and configurations, you can improve the performance and reduce the CPU and memory usage when using Cosmos with dbt models.

To continue talking to Dosu, mention @dosu.

pankajastro commented 3 months ago

We did some performance improvement in 1.5 would it be possible to upgrade and test?

@tatiana has discussed some best practice strategies in a webinar you might be interested to check https://www.astronomer.io/events/webinars/best-practices-effectively-running-dbt-airflow-video/

michal-mrazek commented 3 months ago

We tested upgrade to 1.5.1 but unfortunately, resource utilization did not decrease much.

github-actions[bot] commented 2 weeks ago

This issue is stale because it has been open for 30 days with no activity.

astronomer / astronomer-cosmos