Open tatiana opened 4 months ago
We are experiencing a similar issue in our MWAA environment:
OSError: [Errno 28] No space left on device
.I have done some investigation:
manifest.json
and partial_parse.msgpack
) totalling ~26 MB in each /tmp/cosmos/<...>/target
directory:df -h /tmp
Filesystem Size Used Avail Use% Mounted on
overlay 30G 28G 48M 100% /
sudo du -ah /tmp | sort -n -r | head -n 200
26M /tmp/cosmos/wf_12__DBTGrp_1/target
26M /tmp/cosmos/wf_13__DBTGrp_1/target
26M /tmp/cosmos/wf_14__DBTGrp_1/target
26M /tmp/cosmos/wf_15__DBTGrp_1/target
26M /tmp/cosmos/wf_15__DBTGrp_1/target
...
...
ls -la /tmp/cosmos/wf_12__DBTGrp_1/target
total 25952
drwxr-xr-x 3 airflow airflow 4096 May 24 11:36 ..
drwxr-xr-x 2 airflow airflow 4096 May 24 11:38 .
-rw-r--r-- 1 airflow airflow 13667800 May 25 09:25 manifest.json
-rw-r--r-- 1 airflow airflow 12896530 May 25 09:25 partial_parse.msgpack
ls -la /tmp/cosmos/wf_13__DBTGrp_1/target
total 25952
drwxr-xr-x 3 airflow airflow 4096 May 24 11:36 ..
drwxr-xr-x 2 airflow airflow 4096 May 24 11:38 .
-rw-r--r-- 1 airflow airflow 13659451 May 25 09:25 manifest.json
-rw-r--r-- 1 airflow airflow 12896530 May 25 09:25 partial_parse.msgpack
Rodrigo Rabioglio reported the same issue in the #airflow-dbt Slack channel:
Hello I'm using cosmos==1.3.0 with mwaa 2.7.2 Im getting a DAG import error on MWAA due to lack of disk space. It happens only with astronomer-cosmos dbt dags, as follows. Broken DAG: [/usr/local/airflow/dags/MY_DAG.py] Traceback (most recent call last): File "/usr/local/lib/python3.11/tempfile.py", line 854, in init self.name = mkdtemp(suffix, prefix, dir) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/tempfile.py", line 368, in mkdtemp _os.mkdir(file, 0o700) OSError: [Errno 28] No space left on device: '/tmp/tmp5u4mvh_k' I'm wondering what file can be topping disk space on device :thinking_face: I can't access the underlying mwaa container disk to evaluate it. My dbtdag RenderConfig uses LOAD_METHOD = LoadMode.DBT_LS and the execution mode is set to execution_mode=ExecutionMode.VIRTUALENV,
I believe the issue Rodrigo is facing is likely due to https://github.com/astronomer/astronomer-cosmos/blob/3c9cf6fbceb369efcb6854731833558e8b487749/cosmos/operators/virtualenv.py#L72
This will only be executed in case the operator execution succeeds: https://github.com/astronomer/astronomer-cosmos/blob/3c9cf6fbceb369efcb6854731833558e8b487749/cosmos/operators/virtualenv.py#L106-L107
We should change this to use context manager.
From Cosmos 1.4 onwards, we also received reports that caching locally the partial parse file was leading to issues in MWAA: https://github.com/astronomer/astronomer-cosmos/pull/1025#issuecomment-2160827831
This will hopefully be solved once #927 is implemented.
We are encountering the same issue with temporary directories not being deleted.
It looks like none of the /tmp/cosmos-venv*
directories are being deleted.
To reproduce this issue locally, I have been using composer-local-dev.
I added a line to manually remove the directories using
shutil.rmtree(self.virtualenv_dir, ignore_errors=True)
after the self._release_venv_lock()
line,
which resolved the issue of stale directories. However, I am unsure why this manual removal is necessary or why the directories are not automatically deleted.
Should I create a pull request with this fix, or should we investigate further to identify the root cause?
What we know so far:
virtualenv_dir
is unexpectedly being set to None during execution, which seems to prevent the automatic cleanup.virtualenv_dir
has its expected value at line self.log.info("Releasing virtualenv lock")
self.clean_dir_if_temporary()
, the value of virtualenv_dir
had already been set to None.
TemporaryDirectory
context manager is supposed to delete the directory automatically when exiting the with block.
https://github.com/astronomer/astronomer-cosmos/issues/958#issuecomment-2326243062
I noticed that the temporary directories /tmp/cosmos-venv*
are deleted when is_virtualenv_dir_temporary
is set to True
in Cosmos@1.6.0. However, the issue occurs when is_virtualenv_dir_temporary
is set to False
. In both cases, virtualenv_dir
is None
(default value).
When virtualenv_dir
is None
, I expect temporary directories should be cleaned up after task execution regardless of the is_virtualenv_dir_temporary
setting.
I've observed the following behavior related to this issue in version 1.6.0:
When invoke_dbt
is called twice in DbtLocalBaseOperator.run_command
, the temporary directories are not deleted when DbtVirtualenvBaseOperator.virtualenv_dir
is initially None. In the first execution, virtualenv_dir
is None
, so a temporary virtualenv is created and virtualenv_dir
is set to its path. In the second execution, virtualenv_dir
is not None
, so a new directory is created and it will not be deleted.
This behavior is particularly noticeable when DbtLocalBaseOperator.install_deps
is set to True
, which causes invoke_dbt
to be called twice. Conversely, if invoke_dbt
is called only once, the directory is properly deleted when virtualenv_dir
is initially None.
Is this the intended behavior?
Context
It seems sometimes Cosmos is creating and not deleting temporary directories.
An example of a report, from 30 April 2024 in the #airflow-dbt Slack channel: https://apache-airflow.slack.com/archives/C059CC42E9W/p1714484749579599
There have been previous discussions in the core Airflow https://github.com/apache/airflow/issues/22404 about tempfile.TemporaryDirectory doesn't necessarily behaving as expected.
I was not able to reproduce this issue yet, but one possibility is that there is some exception or error, and the context manager
tempfile.TemporaryDirectory
is not being able to clean things after those scenarios.Acceptance criteria