astronomer / astronomer-cosmos

Run your dbt Core projects as Apache Airflow DAGs and Task Groups with a few lines of code
https://astronomer.github.io/astronomer-cosmos/
Apache License 2.0
596 stars 152 forks source link

Support multiple projects' dbt docs #862

Open dwreeves opened 6 months ago

dwreeves commented 6 months ago

Discussion here: https://github.com/astronomer/astronomer-cosmos/pull/737#issuecomment-1964349542

Better to get this done now before 1.4 ships than after, I think.

tldr: The main worry I have is that the API for supporting multiple projects' docs can get very convoluted. Ini file format / env vars for config do not natively support arrays of objects, so you need to do something that isn't pretty or is complicated (e.g. read from a yaml file) basically no matter what.

I'm leaving this as an open ended issue to discuss solutions.

dosubot[bot] commented 6 months ago

🤖

Hey there @dwreeves! Long time no chat on GitHub issues, huh? 😄

To support multiple projects' dbt docs in the Cosmos repository without complicating the API, you could consider creating multiple instances of the ProjectConfig class, each representing a different dbt project.

The ProjectConfig class in the Cosmos repository is designed to handle the configuration of a single dbt project. It includes parameters such as dbt_project_path, models_relative_path, seeds_relative_path, snapshots_relative_path, manifest_path, project_name, env_vars, dbt_vars, and partial_parse.

Each instance of ProjectConfig can be associated with a different dbt project by specifying a different dbt_project_path (the path to the dbt project directory) and project_name (the name of the project).

Here is an example of how you can create multiple ProjectConfig instances:

project_config1 = ProjectConfig(
    dbt_project_path="/path/to/dbt/project1",
    project_name="project1",
    env_vars={"DBT_ENV_VAR1": "value1"},
    dbt_vars={"dbt_var1": "value1"},
)

project_config2 = ProjectConfig(
    dbt_project_path="/path/to/dbt/project2",
    project_name="project2",
    env_vars={"DBT_ENV_VAR2": "value2"},
    dbt_vars={"dbt_var2": "value2"},
)

In this example, project_config1 and project_config2 represent two different dbt projects. You can create as many ProjectConfig instances as you need, each representing a different dbt project.

This approach allows you to support multiple dbt projects without complicating the API or dealing with the limitations of ini file format and environment variables for configuration. Each ProjectConfig instance encapsulates all the configuration details of a dbt project, making it easy to manage multiple projects.

For more details, you can refer to the ProjectConfig class in the cosmos/config.py file.

Sources

#### About Dosu This response is meant to be useful and save you time. It is **not meant to be a precise solution**, but rather a starting point for your own research. Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant. If you want to continue the conversation, start your reply with @dosu-bot.
dwreeves commented 6 months ago

So one implementation that is fully compatible with the existing API would be to do something like this:

The main challenge here is getting the names of the dbt projects. This requires downloading each json and then retrieving the manifest["metadata"]["project_name"]. It could be prudent, albeit minor over-optimizing, to use flask-caching (and allow the user to override the default time for the cache to 0 if they don't want this) because this does end up downloading a lot of files. (Note: flask-caching is already a dependency of Airflow.) Another option is to allow users to pass in their own labels.

If reading from manifest.metadata.project_name to get the project names for each doc, you will want to have some sort of exception handling or timeout logic perhaps if there is an issue, if say S3 docs aren't loading but the local ones are. I dunno. Maybe that is too complicated.

The issue with creating a new menu item for dbt docs is twofold. First, it's not appropriate for most users with just one project. Two, dbt project names cannot realistically be automated in this context as loading the manifest.json can block the Airflow UI from loading on normal, non-dbt docs pages, and also wastes S3 reads. Or worse: imagine a scenario where your manifest.json is messed up, and your entire Airflow UI crashes because the plugin is attempting to read a corrupt or nonexistent JSON, but you also need access to the Airflow UI to diagnose the problem... not good. You cannot automate the names of the UI elements from the manifest.jsons if you are doing an app-builder menu item approach.

I'm not a happy camper setting any of this up because airflow.cfg (.ini file format) isn't well suited for this (.ini doesn't natively have an array type, which is the proper data model for this), and also it feels like there is no way to avoid that one of the two setups (solo docs, or multi docs) ends up with the suboptimal end of the stick. most users just have one dbt project, so it makes sense to have an abstraction that prioritizes single project deploys (which the current API does well) and let people with complicated setups deal with a slightly complicated API, which is fair and even congruent with what they're doing already. I'm thinking this is the least intrusive way to support multiple projects while keeping the user experience friendly to the majority of users with just one project. (Also, some users are on multiple projects because large projects execute slowly, which is also being addressed in 1.4.)

tatiana commented 2 months ago

@dwreeves, we aim to release Cosmos 1.6 by the end of the month. Do you think you may have the bandwidth to work on this before? If not, what do you think about moving it to the 1.7 release or after?