dbt-labs / dbt-core

dbt enables data analysts and engineers to transform their data using the same practices that software engineers use to build applications.
https://getdbt.com
Apache License 2.0
10.02k stars 1.64k forks source link

[CT-1845] [Feature] Write a file of installed versions during `dbt deps` #6643

Open dbeatty10 opened 1 year ago

dbeatty10 commented 1 year ago

Is this your first time submitting a feature request?

Describe the feature

Problem

It's hard to determine the exact package versions that were installed during the dbt deps process.

Multi-part proposal

We should consider a feature with three phases, each subsequent one being optional:

  1. As part of dbt deps, write a file with the version of each of the installed packages
    • One possible name is packages.json
    • By default, write within a .gitignore'd directory (dbt_packages, target, or something else)
    • This file would be essentially saying, "I'm installing the X version of package Y from location Z"
    • ☝️ we do do this in structured logging
  2. Use this file to make a more compelling user experience(s)
    • Ex. a frozen set of packages à la pip freeze or pip-tools compile
    • Ex. compare the difference between packages asserted to be currently installed and those to-be downloaded; we could then optionally clean out-of-date versions and/or just install only the not-yet-installed versions
  3. Include dbt deps automatically as part of every relevant sub-command

The advantage provided by the 1st phase is that users (and systems!) would know the exact package versions that were installed during the dbt deps process.

I'm initially keeping these three phases together to preserve their common context, but we can split these out into separate issues at any time.

The 1st phase described above is actionable right now, but by the time we get to the 3rd one, we're definitely in speculative territory.

Describe alternatives you've considered

Currently, all dbt packages have a required version that is specified within dbt_project.yml. The problem is that this field has been vestigial its entire lifespan, and it can deviate (and has!) from its associate git tag version numbers that are used by the dbt Package Hub. Furthermore, the implementation of https://github.com/dbt-labs/dbt-core/issues/6603 will remove that version. So we can't count on it.

In my view, the best things to consider are all the sub-alternatives and design decisions within the 1st phase proposal:

Who will this benefit?

This could benefit multiple types of consumers:

  1. Human users that want to create a locked version of their packages.yml for repeatable builds
  2. Just knowing what package was installed (in order to lookup its feature set or bug status)
  3. Tracking the relative share of installations of each version for a particular package
  4. Probably some other uses as well

Are you interested in contributing this feature?

Can serve in a supporting role to whoever else picks this up

Anything else?

No response

jtcohen6 commented 1 year ago

Thanks for the brilliant write-up @dbeatty10!!

As you mentioned, while this isn't something our team has capacity to pick up right now, we'd be very open to supporting a motivated community contributor who would want to start by tackling phase 1. I believe it's directionally correct for the larger vision you've outlined, while also yielding some clear & uncontroversial benefits in the meantime.

joellabes commented 1 year ago

Love all of this. There is a 15% chance I'm the very community contributor you're looking for, but it might be a couple of months away so if someone else comes along and wants to get stuck in, feel free!

jtcohen6 commented 1 year ago

Which "version number" to record for non-Hub installs like git branch or hash, local packages, or tarballs The complications here might make us yearn for a required version field within dbt_project.yaml that is guaranteed not to disagree with any associated git tags...

My $0.02: It's okay if the user experience for these is a bit less compelling than it is for Hub packages. That's the point of the Hub, we can make nicer things happen by making additional metadata available. The analogy here is to the difference between pip install <pypi_package> and pip install git+<repo>. The former can use PyPI (package registry) metadata to resolve interdependencies more neatly, cache more reliably, etc.

For each of these cases, as we imagine moving into a phase 2 (dbt deps with "diffs"):

justbldwn commented 1 year ago

hello dbt team 👋 i just started working on some code locally for this ticket today and would be happy to help contribute to this ticket/discussions surrounding it moving forward.

@dbeatty10 i wanted to ask a few questions on what you outlined especially within phase 1 and 2.

As part of dbt deps, write a file with the version of each of the installed packages

Use this file to make a more compelling user experience(s)


as part of this ticket, would it also be okay to contribute a new feature on top of this? what i'm proposing is to break out dbt deps into sub-commands with the intention of serving the below purpose:

command description
dbt deps add add a package to packages.yml (new)
dbt deps update update a package in packages.yml (new)
dbt deps remove remove a package from packages.yml (new)
dbt deps install installs packages based on what's in packages.yml (formerly just dbt deps)

the add, update, remove sub-commands could take the following arguments:

argument description command
--package name of package to add to packages.yml add, update, remove
--version version of the package being packages.yml add, update
--location location to get package from (package (hub), git, local) add, update

i've already started working on this locally and i don't think it would add much work to the overall effort of this ticket. here's what it may look like within the CLI:

Screen Shot 2023-01-23 at 11 24 17 PM

i think this would be a helpful addition to the user experience of working with packages (via add, update, remove sub-commands) in the packages.yml file, while also giving a dedicated sub-command for a lot of the work in this ticket (via install sub-command) in a newly proposed packages.json to better track of what packages were last installed in a project. let me know your thoughts on this, happy to discuss some more before contributing any other code or anything like that. hope this helps, thanks!

jtcohen6 commented 1 year ago

Awesome to hear your interest @justbldwn!

Quick thoughts from me on the piece around new CLI commands:

dave-connors-3 commented 1 year ago

One of the most commonly requested enhancements to the dbt Cloud IDE is automating the dbt deps step within the development workflow -- I am extremely interested in this issue as a way to resolve that issue and have the whole community benefit!

Feels like having dbt commands do a check of the dbt deps artifact described above would be an extremely nice workflow here. I imagine this would also potentially help us with the request for cleaner dependency conflicts during dbt deps as well (can't find the issue)

@dbeatty10 -- definitely keen on pitching in here (likely in conjunction with @joellabes and @justbldwn !) Let me know how to best plug in here!

justbldwn commented 1 year ago

thanks for all the info @jtcohen6 ! if it's okay, i'll make a separate PR eventually relating to #5302 for the expansion of dbt deps into different sub-commands. good to know this issue is already outstanding, i hadn't seen it before!

as for the scope of this ticket, i just submitted an initial draft PR (#6735) for the ability to collect/write a json file to the dbt_packages/ directory with the installed packages + their dependencies. i'm happy to continue the discussion on the PR itself to know if the proposed format and location of the json file are okay and adjust where necessary.

aranke commented 1 year ago

After chatting with @dbeatty10 about phase 1, I think we should invert our approach to more closely resemble npm:

  1. Adding a packagedbt deps foo where foo can be one of a package published to hub.getdbt.com, a Git repo, a tarball, or a local file. This command will write to both packages.yml and a new lock file (tentatively packages-lock.yml).
  2. Installing packagesdbt deps installs packages like before, but preferentially installs from the lock file (if available) with a fallback to packages.yml.
  3. Updating/Removing packages – Edit packages.yml for now.

This approach has a few advantages:

  1. By committing the lock file to version control, dbt deps installs packages deterministically across all environments.
  2. dbt deps sans arguments speeds up significantly since it doesn't need to perform the expensive operation of resolving package versions. This operation is only performed when packages are added/updated/removed, which is good since lock files are read much, much more than they are written to.
  3. Even if newer versions of the packages listed in packages.yml appear, the lock file remains the same, thus making package updates opt-in instead of opt-out.
  4. dbt deps does not break behavior for users on older versions of dbt sans lock file.
justbldwn commented 1 year ago

hey @aranke, thanks for the feedback on this! i have some thoughts/questions below.

  1. Adding a packagedbt deps foo where foo can be one of a package published to hub.getdbt.com, a Git repo, a tarball, or a local file. This command will write to both packages.yml and a new lock file (tentatively packages-lock.yml).

would the idea here be to have foo include the source + package name as one string or allow for arguments to specify a name/source/revision, etc? something like:

dbt deps dbt-labs/dbt_utils@1.0.0 vs. dbt deps add --package dbt-labs/dbt_utils --version 1.0.0

or

dbt deps git+https://github.com/elementary-data/dbt-data-reliability@0.6.11 vs. dbt deps add --package https://github.com/elementary-data/dbt-data-reliability --version 0.6.11 --source git

with that, could this start to intrude on @jtcohen6 's stance in this comment that dbt deps isn't mean to be a full-fledged package manager?


  1. Installing packagesdbt deps installs packages like before, but preferentially installs from the lock file (if available) with a fallback to packages.yml.

the way i look at this, essentially the packages.yml and the packages-lock.yml would be identical (unless someone wants to manually adjust a version in the packages.yml, like you said in point number 3). but then if dbt is pointing to the packages-lock.yml to install from, how would someone's manual packages.yml changes ever get installed? should a command be included to update the lock manually, i.e. dbt deps lock?


dbt deps sans arguments speeds up significantly since it doesn't need to perform the expensive operation of resolving package versions. This operation is only performed when packages are added/updated/removed, which is good since lock files are read much, much more than they are written to.

do you mind diving into this point a little bit futher? i'm not fully following this, as i believe dbt deps will install all packages in the packages.yml file regardless of whether the entire packages.yml changes or if just one package is added.


i'm happy to contribute more to this to keep momentum going, i'm just hoping for a bit more direction on exactly which way you'd like to take this. thanks again!

noel commented 1 year ago

I think this is brilliant and would have saved several people hours yesterday. We need to save the commit hash because what I think happened in our chase is that a package we were using kept the same version after a change was made, so those who installed the package prior to the change were seeing a different behavior than those who installed it after the change. Making the install deterministic with a type of lock file in the repo would be ideal IMO.

aranke commented 1 year ago

@justbldwn Thanks for your detailed response and your energy in tackling this issue, it is definitely appreciated. My previous comment was broad strokes around what a proposed solution would feel like, but I left many gaps in how to implement such a solution. I think it is worth filling in those gaps here, so without further ado, here's how I see dbt implementing a lock file.


I see the purpose of the lock file to be two-fold:

  1. List all dependencies of a project, including transitive ones that are not explicitly specified in packages.yml. The package names in the lock file are a strict superset of those defined in packages.yml.
  2. Resolve the version range of each package mentioned above to a single version. The package version range defined in packages.yml is a strict superset of those defined in the lock file.

For example, consider the following packages.yml file:

packages:
  - package: calogica/dbt_expectations
    version: [">=0.8.0", "<0.9.0"]

Since dbt_expectations has a dependency on dbt_date, the lock file should look something like this:

packages:
  - package: calogica/dbt_expectations
    version: 0.8.2
  - package: calogica/dbt_date
    version: 0.7.2

With the introduction of the lock file, we now need to support three behaviors:

  1. lock (new, creates lock file)
    1. List all dependencies of a project, including transitive ones
    2. Resolve package versions for each dependency
    3. Write the lock file to disk
  2. deps <package> (new, adds package to packages.yml and lock file)
    1. Write new dependency to packages.yml
    2. Re-create the lock file from scratch using lock. During this process, it is fine to upgrade any package listed in packages.yml even if that package wasn't the one passed as an argument to dbt deps.
  3. deps (old, installs packages defined in lock file with fall back to packages.yml)
    1. If the lock file doesn't exist, run lock to create a lock file.
    2. Verify that the version and package constraints in the lock file don't conflict with those in packages.yml. If they do, re-create the lock file using lock.
    3. Write the packages specified in the lock file to the dbt_packages directory (or override in config). In the happy path scenario, most invocations of dbt deps skip directly here after verifying that the lock file is in sync with packages.yml as outlined in step 2 above. Since the expensive resolution steps are already handled by lock, this step should run really fast since it is a raw downloader.

It's getting late here, and this comment is already really long, so I'll wrap up for now. Hopefully, this is enough to get you unblocked.

There are two things I'll cover in a follow-up post (or @jtcohen6 can weigh in):

  1. What should be included in the lock file to address @noel's concern. If you can get the lock file to a state outlined in section 2, we can polish the finer bits, either here in discussion or in a PR.
  2. Which package installation sources we should support.

Once again, thank you so much for your willingness to help here and look forward to your contributions!

justbldwn commented 1 year ago

@aranke thank you again so much for providing all this info and context when you did. it was extremely helpful and definitely allowed me to proceed forward. i'm hoping to submit a PR soon with changes 🎉 .

one question i came across when working with deps code is related to the dbt_packages/ directory. simply put, should the packages installed in dbt_packages/ persist regardless of what's in packages.yml/packages-lock.yml?

lets say i have 5 packages listed in my packages.yml that looks like this:

packages:
  - package: dbt-labs/dbt_utils
    version: 0.8.4
  - git: "https://github.com/splitgraph/ab2ft_stripe"
    revision: 0.1.0
  - package: dbt-labs/dbt_utils
    version: 0.8.4
  - git: "https://github.com/google/fhir-dbt-analytics"
    revision: 386e8430a4b9735466c90c3313ae67c31dbef58f
  - package: everpeace/dbt_models_metadata
    version: 0.1.0

and i decide that i only want my project's packages.yml to look like this going forward:

packages:
  - package: dbt-labs/dbt_utils
    version: 1.0.0

should we wipe dbt_packages/ and only installed what's defined to also stay in sync with what's going to be defined in the packages.yml and the new packages-lock.yml? currently, any installed packages throughout a project's lifetime persist in the dbt_packages/ directory, regardless of what's defined in packages.yml.

hope that makes sense - let me know if you have any comments/questions!

aranke commented 1 year ago

@justbldwn Good catch, yes, we should make sure that the state in packages-install-path (dbt_packages/ is the default) lines up with packages.yml and package-lock.yml.

For now, this can take the form of wiping and recreating the directory, but in the future we should do the following instead:

  1. Identify the hashes of packages installed in the packages-install-path.
  2. Compare that against hashes of packages specified in package-lock.yml.
  3. Remove extraneous packages and install missing ones.
  4. If a package's hash is inconsistent between the directory and the lock file, re-install a fresh copy of the package from the lock file.

Furthermore, I think we should call the file package-lock.yml since the article (lock) is singular and follows convention from npm.

justbldwn commented 1 year ago

PR submitted #6735 and marked ready for review! there is a lot of context and info included in the PR summary. not sure if discussion should continue here or there, but happy to help whichever way i can 👍

@aranke @jtcohen6 @dbeatty10 @dave-connors-3

ChenyuLInx commented 1 year ago

8408 only achieved half of this.

Will add more details about the later half soon

philippeboyd commented 1 year ago

Hello @ChenyuLInx @justbldwn @aranke We're in a bit of an awkward situation where PR #8408 (available in 1.7.0rc1) broke #5374 (fixed with PR #8322 by me)

Now the problem is with the method _create_packages_yml_entry() where we don't take into account the subdirectory for a git repository and also package.name is used but had been modified in core/dbt/deps/git.py

Now that I understand what you're trying to do here and there might be something I can do about the name property in the previous file (I can work on a fix but the name corresponds to either the git repo or git repo + subdirectory if any) but we also need to take into account the subdirectory in the lock file in #6735

The lock file created doesn't respect the packages.yaml:

packages:
  - package: "calogica/dbt_date"
    version: [ ">=0.5.4", "<0.8.0" ]
  - package: "dbt-labs/dbt_utils"
    version: [ ">=1.0.0", "<2.0.0" ]
  - git: "git@github.com:acme/git-repo.git"
    subdirectory: "dbt-source-pkg-1"
    revision: "dbt-source-pkg-1-v0.1.12"
  - git: "git@github.com:acme/git-repo.git"
    subdirectory: "dbt-source-pkg-2"
    revision: "dbt-source-pkg-2-v0.1.2"

It will instead create the following lock file

❯ cat package-lock.yml
packages:
- package: calogica/dbt_date
  version: 0.7.2
- package: dbt-labs/dbt_utils
  version: 1.1.1
- git: git@github.com:acme/git-repo.git/dbt-source-pkg-1
  revision: dbt-source-pkg-1-v0.1.12
- git: git@github.com:acme/git-repo.git/dbt-source-pkg-2
  revision: dbt-source-pkg-1-v0.1.2
sha1_hash: c42f2<redacted>f258

Resulting in a problem when executing dbt deps since git@github.com:acme/git-repo.git/dbt-source-pkg-1 is not a valid repository name

ryan-pip commented 1 year ago

@philippeboyd, I am using http not ssh for the git repos. These are still resolving and cloning properly even with the subdirectory on the end. I just needed to add the subdirectory line into the lock file so dbt deps knew to find the dbt_project.yml. Not sure if this helps.

philippeboyd commented 1 year ago

Hi @ryan-pip, of course if you manually modify the lock file to add the subdirectory at the proper place it will work but it doesn't fix the problem that I mentioned in my previous post.

packages:
- git: git@github.com:philippeboyd/dbt-monorepo.git
  subdirectory: dbt-utils-main
  revision: v0.1.0
- git: git@github.com:philippeboyd/dbt-monorepo.git
  subdirectory: dbt-date-main
  revision: v0.1.0
sha1_hash: d3336208deda31c19fe1230443c19f7000d3e40e

These are still resolving and cloning properly even with the subdirectory on the end.

Doubt. See my example below.

To replicate the issue easily, I created a repository https://github.com/philippeboyd/dbt-monorepo.git with dbt-date and dbt-utils as the 2 subdirectories in that mono repo.

I tried both the following (git and https) and got the same result

packages:
  - git: "git@github.com:philippeboyd/dbt-monorepo.git"
    subdirectory: "dbt-utils-main"
    revision: "v0.1.0"
  - git: "git@github.com:philippeboyd/dbt-monorepo.git"
    subdirectory: "dbt-date-main"
    revision: "v0.1.0"
packages:
  - git: "https://github.com/philippeboyd/dbt-monorepo.git"
    subdirectory: "dbt-utils-main"
    revision: "v0.1.0"
  - git: "https://github.com/philippeboyd/dbt-monorepo.git"
    subdirectory: "dbt-date-main"
    revision: "v0.1.0"

Respective package-lock.yml files

packages:
- git: git@github.com:philippeboyd/dbt-monorepo.git/dbt-utils-main
  revision: v0.1.0
- git: git@github.com:philippeboyd/dbt-monorepo.git/dbt-date-main
  revision: v0.1.0
sha1_hash: d3336208deda31c19fe1230443c19f7000d3e40e
packages:
- git: https://github.com/philippeboyd/dbt-monorepo.git/dbt-utils-main
  revision: v0.1.0
- git: https://github.com/philippeboyd/dbt-monorepo.git/dbt-date-main
  revision: v0.1.0
sha1_hash: 140e9796f695ebcf548343394d4aa90c3480f477

Both will result in the respective error

❯ dbt deps
02:58:44  Running with dbt=1.7.0-rc1
02:58:51  Updating lock file in file path: /tmp/tmp.lIjWExOoUt/tester/package-lock.yml
02:58:51  Installing git@github.com:philippeboyd/dbt-monorepo.git/dbt-utils-main
02:58:51  Encountered an error:
Internal Error
  Error checking out spec='None' for repo git@github.com:philippeboyd/dbt-monorepo.git/dbt-utils-main
  Cloning into 'cbe032b9b9b68984ccc507ae7f57690c'...
  fatal: remote error:
   philippeboyd/dbt-monorepo.git/dbt-utils-main is not a valid repository name
  Visit https://support.github.com/ for help
❯ dbt deps
02:59:22  Running with dbt=1.7.0-rc1
02:59:25  Updating lock file in file path: /tmp/tmp.lIjWExOoUt/tester/package-lock.yml
02:59:25  Installing https://github.com/philippeboyd/dbt-monorepo.git/dbt-utils-main
02:59:25  Encountered an error:
Internal Error
  Error checking out spec='None' for repo https://github.com/philippeboyd/dbt-monorepo.git/dbt-utils-main
  Cloning into 'fded50563e6648f3ddd7ba6d2bf6fc14'...
  remote: Not Found
  fatal: repository 'https://github.com/philippeboyd/dbt-monorepo.git/dbt-utils-main/' not found

Now if I install it with prior version 1.7.0-b2, it works

❯ dbt deps
03:07:08  Running with dbt=1.7.0-b2
03:07:15  Installing git@github.com:philippeboyd/dbt-monorepo.git/dbt-utils-main
03:07:16  Installed from revision v0.1.0
03:07:16  and subdirectory dbt-utils-main
03:07:16  Installing git@github.com:philippeboyd/dbt-monorepo.git/dbt-date-main
03:07:17  Installed from revision v0.1.0
03:07:17  and subdirectory dbt-date-main
ChenyuLInx commented 1 year ago

@philippeboyd Sorry for seeing this late. And thanks a lot for making that public repo and make this easy to reproduce I looked through the changes, and seems that your change with formatting the subdirectories into name is the right move. I am gonna rework the deps lock piece to make sure the proper lock file is generated for this situation. WIP at https://github.com/dbt-labs/dbt-core/pull/9019

nevdelap commented 1 year ago

Hi all. Before I open a bug as a separate issue, I want to run it past the people collaborating on this issue and see if anyone disagrees with my suggestion?

package-lock.yml doesn't pass yamllint because it doesn't have the --- document separator at the top, and the list within packages: isn't indented.

We'd like it to pass yamllint. As a workaround we're fixing it before committing.

Thoughts?