dbt-labs / dbt-core

dbt enables data analysts and engineers to transform their data using the same practices that software engineers use to build applications.
https://getdbt.com
Apache License 2.0
9.72k stars 1.61k forks source link

[CT-328] [Feature] Package Enabled Flag #4837

Closed csoule-shaker closed 1 year ago

csoule-shaker commented 2 years ago

Is there an existing feature request for this?

Describe the Feature

the ability to set an enabled flag in the packages.yml file to have dbt deps skip that package. The flag should also be able to be set using a variable.

Describe alternatives you've considered

Thought about using a Macro to iterate over a list variable and creating the package code and placing the macro in the packages.yml file, but deps doesn't seem to support this either.

Who will this benefit?

Some packages reference dependent packages that are only used if you enable them through a variable in your project.yml. If a higher level packages references 10 dependents and I am only using 3, being able to set enabled flags using variables would greatly cur down on project clutter, unnecessary documentation in the docs, and reduce project size in the IDE and at runtime.

This recently caused errors in dbt Cloud due to the IDE not able to handle 50MB of files and resulted in not being able to display the folder/file lists.

Are you interested in contributing this feature?

I'm not sure how deps code works, but I can definitely consult on the solution.

Anything else?

Should be fairly simple to parse the flag out of the project,yml and use it as a filter in the function that lists the packages to install.

This would also be a great feature for package developer ecosystem.

emmyoop commented 2 years ago

Thanks for the writeup @csoule-shaker. I'm not sure I understand so I'm going to need a bit of clarification on what you're asking for.

Are you saying you have your packages.yml defined in your project as

packages:
  - package: dbt-labs/adwords
    version: 0.4.0
  - package: dbt-labs/audit_helper
    version: 0.5.0
  - package: dbt-labs/codegen
    version: 0.5.0
  - package: dbt-labs/dbt_external_tables
    version: 0.8.0
  - package: dbt-labs/facebook_ads
    version: 0.8.0

but you're only using 3 of the 5 packages you have defined so you want to be able to say

packages:
  - package: dbt-labs/adwords
    version: 0.4.0
    enabled: true
  - package: dbt-labs/audit_helper
    version: 0.5.0
    enabled: false
  - package: dbt-labs/codegen
    version: 0.5.0
    enabled: true
  - package: dbt-labs/dbt_external_tables
    version: 0.8.0
    enabled: true
  - package: dbt-labs/facebook_ads
    version: 0.8.0
    enabled: false

If this is the correct scenario, can you share what the need to have unused packages defined is? Are you sharing this file across multiple dbt projects?

If I am not understanding correctly, an example of the problem you're trying to resolve might be hugely beneficial to me getting on the same page as you!

csoule-shaker commented 2 years ago

Essentially you have the gist of it. This feature is less for direct end user projects and more for developers that create packages for others. As an example we want to use a package that brings together data from models generated by other packages, making these packages a dependency. This package is called ad_reporting as listed below. Once that package is installed, it also has a package.yml with all the possible dependent package in it.

I only need to use Facebook, Google, and Snapchat ad packages so they already use variables so I can turn off the other models for the other sources. Unfortunately, since the ad_reporting package could use the other sources, we get all the other packages installed as well even if we do not need them, which causes bloat in documentation, space issues, and just clutter in general.

If we were able to leverage those same variables that enable the models on the packages.yml, then we could have the end user only install the packages they need when they need them. This also gives a lot of flexibility to designing packages and I think is a natural feature to introduce.

Feature Behavior: The default value would be true if the flag was not defined. If the flag is defined it should be able to use variables as well as true/false values.

Below I tried to detail out how I envision using tis flag in conjunction with packages that have optional dependencies.

Example: Main Projects Project.yml file

vars: linkedin_ads_enable: False ##Overrides package variable twitter_ads_enable: False ##Overrides package variable

Main Projects Package.yml file

packages:

Ad_Reporting Project.yml file (Default package behavior)

vars: linkedin_ads_enable: True twitter_ads_enable: True facebook_ads_enable: True google_ads_enable: True snapchat_ads_enable: True

Ad Reporting’s Package.yml file

packages:

Chris Soule InterWorks, Inc. | Data Architect t: 202/834-8549 www.interworks.comhttps://www.interworks.com/

From: Emily Rockman @.> Sent: Monday, March 14, 2022 4:48 PM To: dbt-labs/dbt-core @.> Cc: csoule-shaker @.>; Mention @.> Subject: Re: dbt-labs/dbt-core [Feature] Package Enabled Flag (Issue #4837)

Thanks for the writeup @csoule-shakerhttps://github.com/csoule-shaker. I'm not sure I understand so I'm going to need a bit of clarification on what you're asking for.

Are you saying you have your packages.yml defined in your project as

packages:

but you're only using 3 of the 5 packages you have defined so you want to be able to say

packages:

If this is the correct scenario, can you share what the need to have unused packages defined is? Are you sharing this file across multiple dbt projects?

If I am not understanding correctly, an example of the problem you're trying to resolve might be hugely beneficial to me getting on the same page as you!

— Reply to this email directly, view it on GitHubhttps://github.com/dbt-labs/dbt-core/issues/4837#issuecomment-1067271947, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AWRLYJ6YRTVV3RFWY2OPC5TU76QY7ANCNFSM5QEEIE4A. Triage notifications on the go with GitHub Mobile for iOShttps://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Androidhttps://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub. You are receiving this because you were mentioned.Message ID: @.**@.>>

jeremyyeo commented 2 years ago

@emmyoop - there are packages that contain many other packages - the dbt_ad_reporting package contains all these dependencies:

packages:
- package: fivetran/pinterest
  version: [">=0.5.0", "<0.6.0"]

- package: fivetran/microsoft_ads
  version: [">=0.4.0", "<0.5.0"]

- package: fivetran/linkedin
  version: [">=0.4.0", "<0.5.0"]

- package: fivetran/google_ads
  version: [">=0.6.0", "<0.7.0"]

- package: fivetran/tiktok_ads
  version: [">=0.1.0", "<0.2.0"]

- package: fivetran/twitter_ads
  version: [">=0.4.0", "<0.5.0"]

- package: fivetran/facebook_ads
  version: [">=0.4.0", "<0.5.0"]

- package: fivetran/snapchat_ads
  version: [">=0.3.0", "<0.4.0"]

Even if a user may only need to use the google_ads and facebook_ads packages for example (in tandem with some core models/macros provided by the higher level dbt_ad_reporting package). It would be useful to have the ability to skip certain lower level dependencies that is not needed - not too sure what the API looks like but perhaps:

packages:
  - package: fivetran/ad_reporting
    version: [">=0.7.0", "<0.8.0"]
    include_only: ["google_ads", "facebook_ads"]

Which would make dbt deps not install all 8 of those sub-packages when trying to use the dbt_ad_reporting package but just what is required (ad_reporting and it's core models / macros, fivetran/google_ads, fivetran/facebook_ads).


For the record, the on-disk install size when you try to use dbt_ad_reporting is 50 MB (this is due to all the integration csv files and the docs files contained within all 8 of those sub-packages.

Generally this is not an issue but the dbt Cloud IDE doesn't play nicely when you have such a large on-disk dbt_packages/ directory. One of the workarounds currently is to fork each of those 8 packages and remove the bloat (https://github.com/jeremyyeo/dbt_ad_reporting) which trims the on-disk size by ~60% (which may be only marginally helpful in the dbt Cloud IDE tbh).

There's probably a wider conversation to be had here around not including large files (csv / json) when using dbt packages which is separate to this particular ask ("allow users to specify what to include / exclude when packages contain other sub-packages") but thought I'd add this to provide a bit more colour to the motivation for this particular issue.

emmyoop commented 2 years ago

@csoule-shaker and @jeremyyeo thank you for all the clarification. I'm on the same page as you now!

I think @jeremyyeo is very close to the real issue with

There's probably a wider conversation to be had here around not including large files (csv / json) when using dbt packages

I think now is the time to have that conversation! If we could cut down on package size by excluding those csv/json files, that would solve the real problem here. Definitely speak up if there a reason other than package size to exclude/include specific packages via an enabled flag!

Banning the use of these large files in packages isn't the right answer. The ad_reporting files here are what allow the docs to exist here which is pretty neat. It's powerful to be able to navigate the package this way without having it installed. The csv files in the project are for integration tests which are another good practice we definitely don't want to discourage or make difficult.

The right answer is to have a way to mark these files/directories as not needing to be downloaded as part of the package you're including in your project (or a dependency of that project). .dbtignore anyone? 😉

In all honestly, I don't actually know the exact solution here. It could involve how we handle packages in the dbt hub, how we cache dependancies or how we handle installation within dbt-core itself. This is all very worthwhile work to solve the underlying issue that we will need to get prioritized.

csoule-shaker commented 2 years ago

Lert me start by saying I definitely see that there is still a reason to still exclude packages using a enabled flag.

Where there is an issue in dbt Cloud IDE not handling large packages/projects due to this specific issue, that does not invalidate the need for the enable flag feature or the benefits of it.

I wanted to keep these two things separate due to the confusion, as I am not trying to solve for dbt Cloud with this feature request, but for giving the package ecosystem a much needed option for allowing end user to not clutter their projects with optional dependent packages they might not use. As an end user of the ad_reporting package myself for a client, it is hard to tout dbt great documentation feature and then tell them to ignore all the other packages that got installed due to this one package even though they are only using 3-8.

I think this feature would be relatively easy to implement and help this aspect of the problem.

For the other aspect of large files in a dbt package, I like the .dbtignore feature for selecting files folders to not include in the package downloads in order to help minimize the file size, but also think with the growth of dbt and dbt packages, we would still need to find a way to handle large project sizes in the IDE if we want the adoption rate to continue. It was a challenge to work in dbt Cloud when it could not load the folder paths for clients.

I do want to re-iterate though. The feature I am requesting with this is not to address the dbt Cloud IDE issue regarding a projects size, even though it is what led me to it. I still think the need is their for package developers to give user an option to not download the optional dependencies of a higher level package if they do not need them. Package providers are already using optional packages in higher package and as of right now the users are forced to download them and have them as part of their build and in their documentation because there is a lack of a feature to handle this.

I would gladly be part of a separate ticket addressing the dbt Cloud IDE issue on Storage size of a project, but this is a separate feature.

Chris Soule InterWorks, Inc. | Data Architect t: 202/834-8549 www.interworks.comhttps://www.interworks.com/

From: Emily Rockman @.> Sent: Tuesday, March 15, 2022 3:32 PM To: dbt-labs/dbt-core @.> Cc: csoule-shaker @.>; Mention @.> Subject: Re: dbt-labs/dbt-core [Feature] Package Enabled Flag (Issue #4837)

@csoule-shakerhttps://github.com/csoule-shaker and @jeremyyeohttps://github.com/jeremyyeo thank you for all the clarification. I'm on the same page as you now!

I think @jeremyyeohttps://github.com/jeremyyeo is very close to the real issue with

There's probably a wider conversation to be had here around not including large files (csv / json) when using dbt packages

I think now is the time to have that conversation! If we could cut down on package size by excluding those csv/json files, that would solve the real problem here. Definitely speak up if there a reason other than package size to exclude/include specific packages via an enabled flag!

Banning the use of these large files in packages isn't the right answer. The ad_reporting files herehttps://github.com/fivetran/dbt_ad_reporting/tree/main/docs are what allow the docs to exist herehttps://fivetran.github.io/dbt_ad_reporting/#!/overview which is pretty neat. It's powerful to be able to navigate the package this way without having it installed. The csv files in the project are for integration tests which are another good practice we definitely don't want to discourage or make difficult.

The right answer is to have a way to mark these files/directories as not needing to be downloaded as part of the package you're including in your project (or a dependency of that project). .dbtignore anyone? 😉

In all honestly, I don't actually know the exact solution here. It could involve how we handle packages in the dbt hub, how we cache dependancies or how we handle installation within dbt-core itself. This is all very worthwhile work to solve the underlying issue that we will need to get prioritized.

— Reply to this email directly, view it on GitHubhttps://github.com/dbt-labs/dbt-core/issues/4837#issuecomment-1068383928, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AWRLYJ4UGCM2DDMPXKLHWYLVADQRLANCNFSM5QEEIE4A. Triage notifications on the go with GitHub Mobile for iOShttps://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Androidhttps://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub. You are receiving this because you were mentioned.Message ID: @.**@.>>

jeremyyeo commented 2 years ago

@csoule-shaker you're right about the separation of issues here - moved that to #4868 so that we can have that discussion separately.

emmyoop commented 2 years ago

This is great feedback, @csoule-shaker. I obviously missed that detail in the previous replies.

@jeremyyeo opened #4868 (thank you!) to track what I discussed above. I'll leave this one open to track any conversation more directly related to flags to enable specific projects.

There's a complexity with the current architecture where vars are not available in the packages.yml file. I'm going to tag in @gshank to see if she can shed any more light on a possible way to achieve this.

gshank commented 2 years ago

Currently command line vars are available in packages.yml (though there seems to be a bug related to that too) but the vars in the dbt_project file are not available. This is an issue that comes up pretty often in one way or another... I'm wondering if it might not make sense to move vars out into a separate file (vars.yml). Then we could process it before dbt_project.yml, selectors.yml, and packages.yml, and substitute vars. Or we could look for them separately and combine them with cli_vars and achieve a similar thing.

@jtcohen6 I know we've discussed this before , but I think it was in a "might be nice" context. It probably wouldn't be too hard to pull out. And the work that's being done to separate out dbt_project processing might help too.

jtcohen6 commented 2 years ago

I missed these comments a month ago. The extra context helps a lot!

There should be some way to perform "partial" import of a big package such as ad_reporting, akin to extras for Python packages. That could be defined in two places:

One way of doing this, which we're describing above, is by having the ad_reporting package define its dependencies dynamically. The ability to use project-defined vars in package.yml would be the most ergonomic way of instrumenting that (fuller discussion in #4938), but it would be possible to use env_var to set those flags today:


packages:
- package: fivetran/pinterest
  version: [">=0.5.0", "<0.6.0"]
  enabled: "{{ env_var('DBT_FIVETRAN_PINTEREST_ENABLED', True) }}"
- package: fivetran/microsoft_ads
  version: [">=0.4.0", "<0.5.0"]
  enabled: "{{ env_var('DBT_FIVETRAN_MICROSOFT_ADS_ENABLED', True) }}"
...

A simpler approach: Could this just look like the end user providing static enabled configs in their packages.yml, which might override the ones defined the ad_reporting package's packages.yml?

packages:
  # i want this
  - package: fivetran/ad_reporting
    version: [">=0.7.0", "<0.8.0"]
  # but not these
  - package: fivetran/pinterest
    enabled: False
  - package: fivetran/microsoft_ads
    enabled: False

Of course, the ad_reporting package would need to ensure it still works if those parent packages are missing. That should be effectively the same as if they're entirely disabled, but we'd want to double-check.

This might also have interesting implications for the Hub API, which captures the upstream dependencies of every Hub-registered package to aid in version resolution. I think that might be fine, but hard to say for sure.

csoule-shaker commented 2 years ago

Hi Jeremy,

Thanks for taking this up again.

I still think the first method you described would be the best approach here. Putting my end user hat on, it doesn’t require me to break out a bunch of dependent packages that are already listed inside the monolithic package I’m trying to utilize. I know the Fivetran package is what brought this up but also, wanted to put out their that this applies to other vendors ability to produce these large packages as well and not have a bunch a bloat come with it if the consumer is not using certain pieces.

The Ad_reporting package as of today already utilizes the concept of turning off certain models that the client doesn’t need by utilizing variables in their project that are enabled by default and their documentation tells you to add the same variables in your main project.yml and set the value to disabled if you do not need them. So this pattern has already been established and works quite well for models. I think it would be a relatively easy lift for the vendor to incorporate a similar approach to the dependent packages as well.

As for the env_var flags controlling the enabled flag (I didn’t think packages had the ability to use this flag yet either though) that is a good work around or interm approach to the full solution granted the packages.yml file supports the enabled flag. I still think the end result should be to have the project.yml be scanned for variables first, or create a new yml that supports just variables as someone else mentioned, that you can define in the IDE itself (thinking dbt Cloud).

“having the ad_reporting package define its dependencies dynamically”

packages:

...


A simpler approach: Could this just look like the end user providing static enabled configs in their packages.yml, which might override the ones defined the ad_reporting package's packages.yml?

This approach would not be ideal for some of the reasons you mentioned. This takes the responsibility out of the Package vendors hands to support it’s own dependencies and could create more headaches for end users in the end. The vendors should be able to decide which dependent packages are variable and can be turned off, and the above method seems to work around that.

dataders commented 2 years ago

extras for Python packages.

doubling down that this seems the best way to go. This way end users don't have to ignore files they won't use, they simply won't be installed!

csoule-shaker commented 2 years ago

Hey everyone, Just wondering if there was any traction on this enhancement or if it was on the list yet?

joellabes commented 2 years ago

A related use case: @christineberger is adding a new feature to dbt utils (https://github.com/dbt-labs/dbt-utils/pull/624) and we wanted to be able to use a dbt-expectations test in the CI project.

We weren't able to install it from the Hub:

packages:
  - local: ../
  - package: calogica/dbt_expectations
    version: [">=0.5.0", "<0.6.0"]

because expectations also depends on utils and then there were two utils projects in play.

If we could have said "please don't install dbt-labs/dbt_utils" then we wouldn't have had to copy-paste the code over 😅

github-actions[bot] commented 1 year ago

This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please comment on the issue or else it will be closed in 7 days.

github-actions[bot] commented 1 year ago

Although we are closing this issue as stale, it's not gone forever. Issues can be reopened if there is renewed community interest. Just add a comment to notify the maintainers.