astronomer / astronomer-cosmos

Run your dbt Core projects as Apache Airflow DAGs and Task Groups with a few lines of code
https://astronomer.github.io/astronomer-cosmos/
Apache License 2.0
657 stars 170 forks source link

Support caching remotely #927

Closed tatiana closed 2 months ago

tatiana commented 6 months ago

Context

Since #904, Cosmos introduced caching, contributing to the latest performance improvements in 1.4.

However, one of the limitations of this approach is that the cache is stored locally, on disk. This means that:

During the code review of the PR mentioned above, one of the feedbacks was that it would be great if we supported caching this in S3/GCS/Blob storage: https://github.com/astronomer/astronomer-cosmos/pull/904#issuecomment-2032602233 (from @jlaneve).

Another feedback was to leverage Airflow 2.8 ObjectStore: https://github.com/astronomer/astronomer-cosmos/pull/904#issuecomment-2049512157 or/and using an XCom backend to store the cache. (from @kaxil)

Acceptance Criteria

dwreeves commented 6 months ago

This is sort of a duplicate of #870, although I prefer we use this issue as yours is newer and more general. (E.g. I don't mention the use of xcoms as the cache.) Just tagging that issue to relate these discussions.

dwreeves commented 5 months ago

Remote filesystem stuff keeps coming up in multiple contexts. And we already have support for this in, of all places, cosmos/plugin/__init__.py with the open_file() function.

I think this function should be moved to some sort of utils file for interacting with remote filesystems, and the logic for getting the conn_id from the cosmos config should be decoupled from open_file() so it can be used more generically.

That said, we also want to make sure we are doing things idiomatically, as well. For Airflow 2.8+, ObjectStoragePath was essentially designed to do this. I'd like it if Cosmos felt like Airflow, and used things that are standard in Airflow.

For supporting older versions of Airflow, we can create some sort of compatibility thing:

# cosmos/compat/__init__.py
try:
    from airflow.io.path import ObjectStoragePath
except ImportError:
    from cosmos.compat._object_storage_path import ObjectStoragePath

where _object_storage_path.py contains an Airflow 2.4+ compliant implementation of ObjectStoragePath.

pankajkoti commented 3 months ago

We decided to use the Airflow Object Storage feature that is available since Airflow 2.8.0.

Since the approach is decided, I will create sub-tasks for caching remotely for each of the local storage cache package-lock, partial parse, profile cache and then close this ticket.

tatiana commented 2 months ago

We started doing this in #1147, but we still need to extend this feature to support other caches (partial parsing + manifest, profile, dbt_packages.lock). @pankajkoti will be logging sub tasks so we can address this over time.

pankajkoti commented 2 months ago

We have achieved the acceptance criteria for this ticket. For follow-up work, I have created the below tickets: https://github.com/astronomer/astronomer-cosmos/issues/1177 https://github.com/astronomer/astronomer-cosmos/issues/1178 https://github.com/astronomer/astronomer-cosmos/issues/1179

I am hence closing this ticket.