astronomer / astronomer-cosmos

Run your dbt Core projects as Apache Airflow DAGs and Task Groups with a few lines of code
https://astronomer.github.io/astronomer-cosmos/
Apache License 2.0
653 stars 168 forks source link

DBT_MANIFEST mode does not use manifest for task execution #791

Closed seeholza closed 9 months ago

seeholza commented 10 months ago

I am using DBT_MANIFEST parsing mode, and seeing quite a bit of time being spent in each task on parsing the whole project, spending around 1/3 of the task total time on a full parse:

[2024-01-09, 16:51:27 CET] {subprocess.py:94} INFO - 15:51:27  Unable to do partial parsing because saved manifest not found. Starting full parse.
[2024-01-09, 16:51:30 CET] {subprocess.py:94} INFO - 15:51:30  Found 38 models, 6 tests, 20 sources, 0 exposures, 0 metrics, 532 macros, 0 groups, 0 semantic models

I see the command being executed for each task is along the lines of

[2024-01-09, 16:51:23 CET] {subprocess.py:73} INFO - Running command: ['/opt/***/dbt_venv/bin/dbt', 'run', '--models', 'mart_facts_samples', '--profiles-dir', '/tmp/tmpnhr57up0', '--profile', 'default', '--target', 'prod']

I believe that providing the target-dir with the manifest.json in the command execution would probably allow skipping the re-parsing of the project for each task.

Related to this:

I am fairly new to cosmos, so I would be grateful for some pointers in the right direction.

dosubot[bot] commented 10 months ago

πŸ€–

It's great to see your interest in improving Cosmos! Your observation about using the target-dir with the manifest.json in the command execution to potentially skip the re-parsing of the project for each task sounds like a promising approach.

I would encourage you to go ahead and create a pull request with your proposed solution. The maintainers will review it and provide feedback. Your contribution will not only benefit the community but also help enhance the functionality of Cosmos for everyone. Good luck, and thank you for considering to contribute!


This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a πŸ‘ if this is helpful and πŸ‘Ž if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

dwreeves commented 10 months ago

I think this is essentially the same as #785; I think the runtime only uses the manifest if it detects no code changes, which it can only do with the msgpack.

Let me know if you'd like to coordinate adding this change as I was going to get to it this weekend, but we could work together on it. The solution is if an existing target dir is found that the whole target dir should be moved into the tmpdir for the runtime.

seeholza commented 9 months ago

Hi @dwreeves ,

Next to the job I don't have much time to work on this, but i am happy to help if i manage.

I would probably start by adding support for supplying the msgpack in dbt-manifest mode, with a more generic API change in mind after. But we would need agree on the final design of how msgpack/manifest is copied to the tmp directories.

Do you have a branch already? I will close this issue in favor of #785 and we continue discussing there then.

dwreeves commented 9 months ago

Hi @flinz,

I opened a PR, yes: #800.

The way I implemented it is entirely agnostic to LoadMode.DBT_MANIFEST. Basically, by default, it will always check for the existence of the partial parse file before parsing when running a command locally. So any user who is running dbt compile in their CICD, whether they're using manifest or LS, will be fine. (a small subset of DBT_LS users may mysteriously see their performance improve in the next release after this is implemented!).

Also, for what it is worth, I did test that dbt never uses manifest.json for input purposes, only outputs. So when we say "use manifest for task execution" it's never manifest.json, only ever partial_parse.msgpack.

You raise a good point that #785 is actually two issues in one (one is the msgpack thing, the other is about execution of projects using artifacts from cloud storage). I made a note of that in #800 and I do believe that these should be separated.