Open dwreeves opened 6 months ago
🤖
Hey there @dwreeves! Long time no chat on GitHub issues, huh? 😄
To support multiple projects' dbt docs in the Cosmos repository without complicating the API, you could consider creating multiple instances of the ProjectConfig
class, each representing a different dbt project.
The ProjectConfig
class in the Cosmos repository is designed to handle the configuration of a single dbt project. It includes parameters such as dbt_project_path
, models_relative_path
, seeds_relative_path
, snapshots_relative_path
, manifest_path
, project_name
, env_vars
, dbt_vars
, and partial_parse
.
Each instance of ProjectConfig
can be associated with a different dbt project by specifying a different dbt_project_path
(the path to the dbt project directory) and project_name
(the name of the project).
Here is an example of how you can create multiple ProjectConfig
instances:
project_config1 = ProjectConfig(
dbt_project_path="/path/to/dbt/project1",
project_name="project1",
env_vars={"DBT_ENV_VAR1": "value1"},
dbt_vars={"dbt_var1": "value1"},
)
project_config2 = ProjectConfig(
dbt_project_path="/path/to/dbt/project2",
project_name="project2",
env_vars={"DBT_ENV_VAR2": "value2"},
dbt_vars={"dbt_var2": "value2"},
)
In this example, project_config1
and project_config2
represent two different dbt projects. You can create as many ProjectConfig
instances as you need, each representing a different dbt project.
This approach allows you to support multiple dbt projects without complicating the API or dealing with the limitations of ini file format and environment variables for configuration. Each ProjectConfig
instance encapsulates all the configuration details of a dbt project, making it easy to manage multiple projects.
For more details, you can refer to the ProjectConfig
class in the cosmos/config.py file.
So one implementation that is fully compatible with the existing API would be to do something like this:
zip()
with the paths.The main challenge here is getting the names of the dbt projects. This requires downloading each json
and then retrieving the manifest["metadata"]["project_name"]
. It could be prudent, albeit minor over-optimizing, to use flask-caching
(and allow the user to override the default time for the cache to 0 if they don't want this) because this does end up downloading a lot of files. (Note: flask-caching
is already a dependency of Airflow.) Another option is to allow users to pass in their own labels.
If reading from manifest.metadata.project_name
to get the project names for each doc, you will want to have some sort of exception handling or timeout logic perhaps if there is an issue, if say S3 docs aren't loading but the local ones are. I dunno. Maybe that is too complicated.
The issue with creating a new menu item for dbt docs is twofold. First, it's not appropriate for most users with just one project. Two, dbt project names cannot realistically be automated in this context as loading the manifest.json
can block the Airflow UI from loading on normal, non-dbt docs pages, and also wastes S3 reads. Or worse: imagine a scenario where your manifest.json
is messed up, and your entire Airflow UI crashes because the plugin is attempting to read a corrupt or nonexistent JSON, but you also need access to the Airflow UI to diagnose the problem... not good. You cannot automate the names of the UI elements from the manifest.json
s if you are doing an app-builder menu item approach.
I'm not a happy camper setting any of this up because airflow.cfg
(.ini
file format) isn't well suited for this (.ini
doesn't natively have an array type, which is the proper data model for this), and also it feels like there is no way to avoid that one of the two setups (solo docs, or multi docs) ends up with the suboptimal end of the stick. most users just have one dbt project, so it makes sense to have an abstraction that prioritizes single project deploys (which the current API does well) and let people with complicated setups deal with a slightly complicated API, which is fair and even congruent with what they're doing already. I'm thinking this is the least intrusive way to support multiple projects while keeping the user experience friendly to the majority of users with just one project. (Also, some users are on multiple projects because large projects execute slowly, which is also being addressed in 1.4.)
@dwreeves, we aim to release Cosmos 1.6 by the end of the month. Do you think you may have the bandwidth to work on this before? If not, what do you think about moving it to the 1.7 release or after?
Discussion here: https://github.com/astronomer/astronomer-cosmos/pull/737#issuecomment-1964349542
Better to get this done now before 1.4 ships than after, I think.
tldr: The main worry I have is that the API for supporting multiple projects' docs can get very convoluted. Ini file format / env vars for config do not natively support arrays of objects, so you need to do something that isn't pretty or is complicated (e.g. read from a yaml file) basically no matter what.
I'm leaving this as an open ended issue to discuss solutions.