astronomer / astronomer-cosmos

Run your dbt Core projects as Apache Airflow DAGs and Task Groups with a few lines of code
https://astronomer.github.io/astronomer-cosmos/
Apache License 2.0
596 stars 152 forks source link

[Bug] Cosmos stale temporary directories #958

Open tatiana opened 4 months ago

tatiana commented 4 months ago

Context

It seems sometimes Cosmos is creating and not deleting temporary directories.

An example of a report, from 30 April 2024 in the #airflow-dbt Slack channel: https://apache-airflow.slack.com/archives/C059CC42E9W/p1714484749579599

Robert Jones: Good day I'm running cosmos 1.3.2 on MWAA 2.7.2. I'm using the load_method=LoadMode.DBT_MANIFEST in the RenderConfig. But the tmp is filling up over time with dirs that are mostly symlinks but then there is also a target directories that has the compiled code and graph.gpickle. Does cosmos come with a cleanup mechanism or do I need to build a DAG that cleans the tmp directory?

Here are some examples:
[2024-04-30T12:45:15.478+0000] {{subprocess.py:93}} INFO - /tmp/tmp0x486tay/target:
[2024-04-30T12:45:15.478+0000] {{subprocess.py:93}} INFO - total 38M
[2024-04-30T12:45:15.478+0000] {{subprocess.py:93}} INFO - drwxr-xr-x 4 airflow airflow 4.0K Apr 30 11:37 .
[2024-04-30T12:45:15.478+0000] {{subprocess.py:93}} INFO - drwx------ 4 airflow airflow 4.0K Apr 30 11:36 ..
[2024-04-30T12:45:15.479+0000] {{subprocess.py:93}} INFO - drwxr-xr-x 3 airflow airflow 4.0K Apr 30 11:36 compiled
[2024-04-30T12:45:15.481+0000] {{subprocess.py:93}} INFO - -rw-r--r-- 1 airflow airflow 6.4M Apr 30 11:36 graph.gpickle
[2024-04-30T12:45:15.482+0000] {{subprocess.py:93}} INFO - -rw-r--r-- 1 airflow airflow 843K Apr 30 11:36 graph_summary.json
[2024-04-30T12:45:15.482+0000] {{subprocess.py:93}} INFO - -rw-r--r-- 1 airflow airflow  17M Apr 30 11:37 manifest.json
[2024-04-30T12:45:15.482+0000] {{subprocess.py:93}} INFO - -rw-r--r-- 1 airflow airflow  14M Apr 30 11:36 partial_parse.msgpack
[2024-04-30T12:45:15.482+0000] {{subprocess.py:93}} INFO - drwxr-xr-x 3 airflow airflow 4.0K Apr 30 11:36 run
[2024-04-30T12:45:15.483+0000] {{subprocess.py:93}} INFO - -rw-r--r-- 1 airflow airflow 3.9K Apr 30 11:37 run_results.json
[2024-04-30T12:45:15.483+0000] {{subprocess.py:93}} INFO - -rw-r--r-- 1 airflow airflow  234 Apr 30 11:37 semantic_manifest.json
[2024-04-30T12:45:15.304+0000] {{subprocess.py:93}} INFO - drwx------ 3 airflow airflow 4.0K Apr 28 03:15 tmp0m2gkxru
[2024-04-30T12:45:15.304+0000] {{subprocess.py:93}} INFO - drwx------ 4 airflow airflow 4.0K Apr 30 11:36 tmp0x486tay
[2024-04-30T12:45:15.304+0000] {{subprocess.py:93}} INFO - drwx------ 3 airflow airflow 4.0K Apr 29 13:45 tmp0zf1jeip
[2024-04-30T12:45:15.305+0000] {{subprocess.py:93}} INFO - drwx------ 3 airflow airflow 4.0K Apr 29 14:15 tmp12fmwv6j
[2024-04-30T12:45:15.305+0000] {{subprocess.py:93}} INFO - drwx------ 3 airflow airflow 4.0K Apr 29 03:15 tmp1i3_4i36
[2024-04-30T12:45:15.305+0000] {{subprocess.py:93}} INFO - drwx------ 3 airflow airflow 4.0K Apr 30 12:44 tmp20rc3co2
[2024-04-30T12:45:15.305+0000] {{subprocess.py:93}} INFO - drwx------ 4 airflow airflow 4.0K Apr 30 11:48 tmp24zhr1yr
[2024-04-30T12:45:15.305+0000] {{subprocess.py:93}} INFO - drwx------ 3 airflow airflow 4.0K Apr 29 14:00 tmp35wco4yc
[2024-04-30T12:45:15.306+0000] {{subprocess.py:93}} INFO - drwx------ 3 airflow airflow 4.0K Apr 26 07:20 tmp3_ugz4uy
[2024-04-30T12:45:15.306+0000] {{subprocess.py:93}} INFO - drwx------ 4 airflow airflow 4.0K Apr 30 11:49 tmp3nk2g9_w
[2024-04-30T12:45:15.306+0000] {{subprocess.py:93}} INFO - drwx------ 3 airflow airflow 4.0K Apr 28 03:15 tmp3pverouw
[2024-04-30T12:45:15.306+0000] {{subprocess.py:93}} INFO - drwx------ 4 airflow airflow 4.0K Apr 30 11:36 tmp4md00dzk
[2024-04-30T12:45:15.312+0000] {{subprocess.py:93}} INFO - drwx------ 3 airflow airflow 4.0K Apr 29 14:00 tmp59t5kcsa
[2024-04-30T12:45:15.313+0000] {{subprocess.py:93}} INFO - drwx------ 3 airflow airflow 4.0K Apr 29 14:15 tmp5_yn3j90
[2024-04-30T12:45:15.313+0000] {{subprocess.py:93}} INFO - -rw------- 1 airflow airflow  13K Apr 30 11:31 tmp5c_o7q6w
[2024-04-30T12:45:15.314+0000] {{subprocess.py:93}} INFO - drwx------ 3 airflow airflow 4.0K Apr 30 03:15 tmp5di13bsx
[2024-04-30T12:45:15.314+0000] {{subprocess.py:93}} INFO - drwx------ 3 airflow airflow 4.0K Apr 27 03:15 tmp5p8h7bk2
[2024-04-30T12:45:15.314+0000] {{subprocess.py:93}} INFO - drwx------ 4 airflow airflow 4.0K Apr 30 11:36 tmp5zdqwdez
[2024-04-30T12:45:15.315+0000] {{subprocess.py:93}} INFO - drwx------ 4 airflow airflow 4.0K Apr 30 11:49 tmp62enlhb2
[2024-04-30T12:45:15.317+0000] {{subprocess.py:93}} INFO - drwx------ 3 airflow airflow 4.0K Apr 29 14:00 tmp6gv98dj3
[2024-04-30T12:45:15.318+0000] {{subprocess.py:93}} INFO - drwx------ 3 airflow airflow 4.0K Apr 29 14:00 tmp74qo5pm7
[2024-04-30T12:45:15.318+0000] {{subprocess.py:93}} INFO - drwx------ 3 airflow airflow 4.0K Apr 30 03:15 tmp86bttjpg
[2024-04-30T12:45:15.318+0000] {{subprocess.py:93}} INFO - drwx------ 3 airflow airflow 4.0K Apr 29 14:00 tmp8_601uyf
[2024-04-30T12:45:15.324+0000] {{subprocess.py:93}} INFO - drwx------ 3 airflow airflow 4.0K Apr 29 03:15 tmp8p1if86h
[2024-04-30T12:45:15.324+0000] {{subprocess.py:93}} INFO - drwx------ 3 airflow airflow 4.0K Apr 29 14:15 tmp99tdocou
[2024-04-30T12:45:15.325+0000] {{subprocess.py:93}} INFO - drwx------ 3 airflow airflow 4.0K Apr 29 14:15 tmp9l6ljh4p
[2024-04-30T12:45:15.325+0000] {{subprocess.py:93}} INFO - drwx------ 4 airflow airflow 4.0K Apr 30 11:36 tmp9tx575tl
[2024-04-30T12:45:15.325+0000] {{subprocess.py:93}} INFO - drwx------ 3 airflow airflow 4.0K Apr 29 13:30 tmp_0uymz3x
[2024-04-30T12:45:15.325+0000] {{subprocess.py:93}} INFO - drwx------ 3 airflow airflow 4.0K Apr 30 03:15 tmp_ph3232z
[2024-04-30T12:45:15.325+0000] {{subprocess.py:93}} INFO - drwx------ 4 airflow airflow 4.0K Apr 30 11:37 tmpbtvqwlhb
[2024-04-30T12:45:15.326+0000] {{subprocess.py:93}} INFO - drwx------ 3 airflow airflow 4.0K Apr 29 14:15 tmpd1eb6y2y
[2024-04-30T12:45:15.326+0000] {{subprocess.py:93}} INFO - drwx------ 4 airflow airflow 4.0K Apr 30 11:39 tmpe_gz5es8
[2024-04-30T12:45:15.326+0000] {{subprocess.py:93}} INFO - drwx------ 3 airflow airflow 4.0K Apr 29 02:15 tmpeaqsvtm4
[2024-04-30T12:45:15.326+0000] {{subprocess.py:93}} INFO - drwx------ 3 airflow airflow 4.0K Apr 30 12:44 tmpem8_pa52
[2024-04-30T12:45:15.326+0000] {{subprocess.py:93}} INFO - drwx------ 3 airflow airflow 4.0K Apr 29 14:15 tmpemz78wrw
[2024-04-30T12:45:15.326+0000] {{subprocess.py:93}} INFO - drwx------ 4 airflow airflow 4.0K Apr 30 11:36 tmpf3v5iz3_
[2024-04-30T12:45:15.327+0000] {{subprocess.py:93}} INFO - drwx------ 4 airflow airflow 4.0K Apr 30 11:39 tmpfb8rvq9v
[2024-04-30T12:45:15.327+0000] {{subprocess.py:93}} INFO - drwx------ 3 airflow airflow 4.0K Apr 26 07:20 tmpho39phpd
[2024-04-30T12:45:15.327+0000] {{subprocess.py:93}} INFO - drwx------ 3 airflow airflow 4.0K Apr 29 14:15 tmphxru1hq8
[2024-04-30T12:45:15.327+0000] {{subprocess.py:93}} INFO - drwx------ 4 airflow airflow 4.0K Apr 30 11:36 tmpiamgv8ox
[2024-04-30T12:45:15.327+0000] {{subprocess.py:93}} INFO - drwx------ 3 airflow airflow 4.0K Apr 29 14:15 tmpj_k0_9h4
[2024-04-30T12:45:15.327+0000] {{subprocess.py:93}} INFO - drwx------ 3 airflow airflow 4.0K Apr 29 14:15 tmpjg_xwyd9
[2024-04-30T12:45:15.328+0000] {{subprocess.py:93}} INFO - drwx------ 4 airflow airflow 4.0K Apr 30 11:48 tmpjpxu2ltp
[2024-04-30T12:45:15.329+0000] {{subprocess.py:93}} INFO - drwx------ 3 airflow airflow 4.0K Apr 29 14:00 tmpjr32_bt0
[2024-04-30T12:45:15.329+0000] {{subprocess.py:93}} INFO - drwx------ 4 airflow airflow 4.0K Apr 30 11:38 tmpjs2i5xmv
[2024-04-30T12:45:15.330+0000] {{subprocess.py:93}} INFO - drwx------ 3 airflow airflow 4.0K Apr 29 14:00 tmpjsg9rdn6
[2024-04-30T12:45:15.330+0000] {{subprocess.py:93}} INFO - drwx------ 3 airflow airflow 4.0K Apr 28 02:15 tmpkcw52xq4
[2024-04-30T12:45:15.330+0000] {{subprocess.py:93}} INFO - drwx------ 3 airflow airflow 4.0K Apr 29 14:00 tmpksxjhhda
[2024-04-30T12:45:15.332+0000] {{subprocess.py:93}} INFO - drwx------ 3 airflow airflow 4.0K Apr 26 07:20 tmplkcqmlm8
[2024-04-30T12:45:15.333+0000] {{subprocess.py:93}} INFO - drwx------ 4 airflow airflow 4.0K Apr 30 11:36 tmpn57z5j0_
[2024-04-30T12:45:15.335+0000] {{subprocess.py:93}} INFO - drwx------ 4 airflow airflow 4.0K Apr 30 11:49 tmpna7v18jf
[2024-04-30T12:45:15.335+0000] {{subprocess.py:93}} INFO - drwx------ 4 airflow airflow 4.0K Apr 30 11:36 tmpne2u4xqd
[2024-04-30T12:45:15.335+0000] {{subprocess.py:93}} INFO - drwx------ 3 airflow airflow 4.0K Apr 28 03:15 tmpnpbketqi
[2024-04-30T12:45:15.335+0000] {{subprocess.py:93}} INFO - drwx------ 3 airflow airflow 4.0K Apr 29 14:15 tmpnsrts_b8
[2024-04-30T12:45:15.338+0000] {{subprocess.py:93}} INFO - drwx------ 4 airflow airflow 4.0K Apr 30 11:43 tmpp6qk2k5w
[2024-04-30T12:45:15.338+0000] {{subprocess.py:93}} INFO - drwx------ 3 airflow airflow 4.0K Apr 29 14:15 tmppf0pprft
[2024-04-30T12:45:15.338+0000] {{subprocess.py:93}} INFO - drwx------ 4 airflow airflow 4.0K Apr 30 11:37 tmppjrk253b
[2024-04-30T12:45:15.342+0000] {{subprocess.py:93}} INFO - drwx------ 3 airflow airflow 4.0K Apr 30 02:15 tmpq5_8ma32
[2024-04-30T12:45:15.342+0000] {{subprocess.py:93}} INFO - drwx------ 3 airflow airflow 4.0K Apr 29 14:00 tmpq6blaw7z
[2024-04-30T12:45:15.344+0000] {{subprocess.py:93}} INFO - drwx------ 3 airflow airflow 4.0K Apr 29 14:00 tmpr5zqdd56
[2024-04-30T12:45:15.344+0000] {{subprocess.py:93}} INFO - drwx------ 3 airflow airflow 4.0K Apr 29 14:00 tmpsrpybyij
[2024-04-30T12:45:15.358+0000] {{subprocess.py:93}} INFO - drwx------ 4 airflow airflow 4.0K Apr 30 11:36 tmptcyhp6is
[2024-04-30T12:45:15.358+0000] {{subprocess.py:93}} INFO - drwx------ 3 airflow airflow 4.0K Apr 26 07:20 tmpu591ld0j
[2024-04-30T12:45:15.358+0000] {{subprocess.py:93}} INFO - drwx------ 3 airflow airflow 4.0K Apr 27 03:15 tmpuop7l2k4
[2024-04-30T12:45:15.358+0000] {{subprocess.py:93}} INFO - drwx------ 4 airflow airflow 4.0K Apr 30 11:36 tmpvox078h7
[2024-04-30T12:45:15.358+0000] {{subprocess.py:93}} INFO - drwx------ 3 airflow airflow 4.0K Apr 29 14:15 tmpx6jq5_ct
[2024-04-30T12:45:15.359+0000] {{subprocess.py:93}} INFO - drwx------ 3 airflow airflow 4.0K Apr 27 02:15 tmpxi6odkmu
[2024-04-30T12:45:15.359+0000] {{subprocess.py:93}} INFO - drwx------ 4 airflow airflow 4.0K Apr 30 11:46 tmpxpsqdy3a
[2024-04-30T12:45:15.359+0000] {{subprocess.py:93}} INFO - drwx------ 3 airflow airflow 4.0K Apr 29 14:15 tmpy3vhu055
[2024-04-30T12:45:15.359+0000] {{subprocess.py:93}} INFO - drwx------ 3 airflow airflow 4.0K Apr 29 03:15 tmpyi3sxcqm
[2024-04-30T12:45:15.362+0000] {{subprocess.py:93}} INFO - drwx------ 3 airflow airflow 4.0K Apr 27 03:15 tmpzcq8r5_a
[2024-04-30T12:45:15.365+0000] {{subprocess.py:93}} INFO - drwx------ 3 airflow airflow 4.0K Apr 29 13:30 tmpzw4pw0ve

There have been previous discussions in the core Airflow https://github.com/apache/airflow/issues/22404 about tempfile.TemporaryDirectory doesn't necessarily behaving as expected.

I was not able to reproduce this issue yet, but one possibility is that there is some exception or error, and the context manager tempfile.TemporaryDirectory is not being able to clean things after those scenarios.

Acceptance criteria

maslick commented 3 months ago

We are experiencing a similar issue in our MWAA environment:

I have done some investigation:

df -h /tmp
Filesystem      Size  Used Avail Use% Mounted on
overlay          30G   28G   48M 100% /
sudo du -ah /tmp | sort -n -r | head -n 200
26M /tmp/cosmos/wf_12__DBTGrp_1/target
26M /tmp/cosmos/wf_13__DBTGrp_1/target
26M /tmp/cosmos/wf_14__DBTGrp_1/target
26M /tmp/cosmos/wf_15__DBTGrp_1/target
26M /tmp/cosmos/wf_15__DBTGrp_1/target
...
...
ls -la /tmp/cosmos/wf_12__DBTGrp_1/target
total 25952
drwxr-xr-x 3 airflow airflow     4096 May 24 11:36 ..
drwxr-xr-x 2 airflow airflow     4096 May 24 11:38 .
-rw-r--r-- 1 airflow airflow 13667800 May 25 09:25 manifest.json
-rw-r--r-- 1 airflow airflow 12896530 May 25 09:25 partial_parse.msgpack
ls -la /tmp/cosmos/wf_13__DBTGrp_1/target
total 25952
drwxr-xr-x 3 airflow airflow     4096 May 24 11:36 ..
drwxr-xr-x 2 airflow airflow     4096 May 24 11:38 .
-rw-r--r-- 1 airflow airflow 13659451 May 25 09:25 manifest.json
-rw-r--r-- 1 airflow airflow 12896530 May 25 09:25 partial_parse.msgpack
tatiana commented 3 months ago

Rodrigo Rabioglio reported the same issue in the #airflow-dbt Slack channel:

Hello I'm using cosmos==1.3.0 with mwaa 2.7.2 Im getting a DAG import error on MWAA due to lack of disk space. It happens only with astronomer-cosmos dbt dags, as follows. Broken DAG: [/usr/local/airflow/dags/MY_DAG.py] Traceback (most recent call last): File "/usr/local/lib/python3.11/tempfile.py", line 854, in init self.name = mkdtemp(suffix, prefix, dir) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/tempfile.py", line 368, in mkdtemp _os.mkdir(file, 0o700) OSError: [Errno 28] No space left on device: '/tmp/tmp5u4mvh_k' I'm wondering what file can be topping disk space on device :thinking_face: I can't access the underlying mwaa container disk to evaluate it. My dbtdag RenderConfig uses LOAD_METHOD = LoadMode.DBT_LS and the execution mode is set to execution_mode=ExecutionMode.VIRTUALENV,

tatiana commented 3 months ago

I believe the issue Rodrigo is facing is likely due to https://github.com/astronomer/astronomer-cosmos/blob/3c9cf6fbceb369efcb6854731833558e8b487749/cosmos/operators/virtualenv.py#L72

This will only be executed in case the operator execution succeeds: https://github.com/astronomer/astronomer-cosmos/blob/3c9cf6fbceb369efcb6854731833558e8b487749/cosmos/operators/virtualenv.py#L106-L107

We should change this to use context manager.

From Cosmos 1.4 onwards, we also received reports that caching locally the partial parse file was leading to issues in MWAA: https://github.com/astronomer/astronomer-cosmos/pull/1025#issuecomment-2160827831

This will hopefully be solved once #927 is implemented.

kesompochy commented 2 weeks ago

We are encountering the same issue with temporary directories not being deleted.

It looks like none of the /tmp/cosmos-venv* directories are being deleted.

To reproduce this issue locally, I have been using composer-local-dev.

I added a line to manually remove the directories using

shutil.rmtree(self.virtualenv_dir, ignore_errors=True)

after the self._release_venv_lock() line,

https://github.com/astronomer/astronomer-cosmos/blob/1f1fc61e637a5e66b1de483200782a7088246261/cosmos/operators/virtualenv.py#L120

which resolved the issue of stale directories. However, I am unsure why this manual removal is necessary or why the directories are not automatically deleted.

Should I create a pull request with this fix, or should we investigate further to identify the root cause?

What we know so far:

kesompochy commented 2 weeks ago

https://github.com/astronomer/astronomer-cosmos/issues/958#issuecomment-2326243062

I noticed that the temporary directories /tmp/cosmos-venv* are deleted when is_virtualenv_dir_temporary is set to True in Cosmos@1.6.0. However, the issue occurs when is_virtualenv_dir_temporary is set to False. In both cases, virtualenv_dir is None (default value). When virtualenv_dir is None, I expect temporary directories should be cleaned up after task execution regardless of the is_virtualenv_dir_temporary setting.

kesompochy commented 2 weeks ago

I've observed the following behavior related to this issue in version 1.6.0: When invoke_dbt is called twice in DbtLocalBaseOperator.run_command, the temporary directories are not deleted when DbtVirtualenvBaseOperator.virtualenv_dir is initially None. In the first execution, virtualenv_dir is None, so a temporary virtualenv is created and virtualenv_dir is set to its path. In the second execution, virtualenv_dir is not None, so a new directory is created and it will not be deleted. This behavior is particularly noticeable when DbtLocalBaseOperator.install_deps is set to True, which causes invoke_dbt to be called twice. Conversely, if invoke_dbt is called only once, the directory is properly deleted when virtualenv_dir is initially None. Is this the intended behavior?