Closed tatiana closed 2 months ago
This is sort of a duplicate of #870, although I prefer we use this issue as yours is newer and more general. (E.g. I don't mention the use of xcoms as the cache.) Just tagging that issue to relate these discussions.
Remote filesystem stuff keeps coming up in multiple contexts. And we already have support for this in, of all places, cosmos/plugin/__init__.py
with the open_file()
function.
I think this function should be moved to some sort of utils file for interacting with remote filesystems, and the logic for getting the conn_id from the cosmos config should be decoupled from open_file()
so it can be used more generically.
That said, we also want to make sure we are doing things idiomatically, as well. For Airflow 2.8+, ObjectStoragePath
was essentially designed to do this. I'd like it if Cosmos felt like Airflow, and used things that are standard in Airflow.
For supporting older versions of Airflow, we can create some sort of compatibility thing:
# cosmos/compat/__init__.py
try:
from airflow.io.path import ObjectStoragePath
except ImportError:
from cosmos.compat._object_storage_path import ObjectStoragePath
where _object_storage_path.py
contains an Airflow 2.4+ compliant implementation of ObjectStoragePath
.
We decided to use the Airflow Object Storage feature that is available since Airflow 2.8.0.
Since the approach is decided, I will create sub-tasks for caching remotely for each of the local storage cache package-lock, partial parse, profile cache and then close this ticket.
We started doing this in #1147, but we still need to extend this feature to support other caches (partial parsing + manifest, profile, dbt_packages.lock
). @pankajkoti will be logging sub tasks so we can address this over time.
We have achieved the acceptance criteria for this ticket. For follow-up work, I have created the below tickets: https://github.com/astronomer/astronomer-cosmos/issues/1177 https://github.com/astronomer/astronomer-cosmos/issues/1178 https://github.com/astronomer/astronomer-cosmos/issues/1179
I am hence closing this ticket.
Context
Since #904, Cosmos introduced caching, contributing to the latest performance improvements in 1.4.
However, one of the limitations of this approach is that the cache is stored locally, on disk. This means that:
KubernetesExecutor
During the code review of the PR mentioned above, one of the feedbacks was that it would be great if we supported caching this in S3/GCS/Blob storage: https://github.com/astronomer/astronomer-cosmos/pull/904#issuecomment-2032602233 (from @jlaneve).
Another feedback was to leverage Airflow 2.8 ObjectStore: https://github.com/astronomer/astronomer-cosmos/pull/904#issuecomment-2049512157 or/and using an XCom backend to store the cache. (from @kaxil)
Acceptance Criteria