Open dbeatty10 opened 1 year ago
Thanks for the brilliant write-up @dbeatty10!!
As you mentioned, while this isn't something our team has capacity to pick up right now, we'd be very open to supporting a motivated community contributor who would want to start by tackling phase 1. I believe it's directionally correct for the larger vision you've outlined, while also yielding some clear & uncontroversial benefits in the meantime.
Love all of this. There is a 15% chance I'm the very community contributor you're looking for, but it might be a couple of months away so if someone else comes along and wants to get stuck in, feel free!
Which "version number" to record for non-Hub installs like git branch or hash, local packages, or tarballs The complications here might make us yearn for a required version field within dbt_project.yaml that is guaranteed not to disagree with any associated git tags...
My $0.02: It's okay if the user experience for these is a bit less compelling than it is for Hub packages. That's the point of the Hub, we can make nicer things happen by making additional metadata available. The analogy here is to the difference between pip install <pypi_package>
and pip install git+<repo>
. The former can use PyPI (package registry) metadata to resolve interdependencies more neatly, cache more reliably, etc.
For each of these cases, as we imagine moving into a phase 2 (dbt deps
with "diffs"):
git
packages: We could go one of two ways. (1) We just track the associated revision
(branch/tag name), and we always assume that it hasn't changed remotely. We give users the option of a dbt deps --force
or --reinstall
if they've knowingly updated a remote revision by the same name. Or, (2) packages.json
should include the actually installed commit SHA from the previous install, and the deps
"resolution" process (before/separate from actual installation) should "discover" the latest commit SHA and compare. The latter is fancier, but it also requires more communication with the git remote, which adds time/overhead to the experience.local
installs are generally symlinks (though not on Windows!), so in theory there's no need to re-"install"tarball
installs point to an exact URL. We should assume that if the URL hasn't changed, the remote contents haven't either.hello dbt team 👋 i just started working on some code locally for this ticket today and would be happy to help contribute to this ticket/discussions surrounding it moving forward.
@dbeatty10 i wanted to ask a few questions on what you outlined especially within phase 1 and 2.
As part of dbt deps, write a file with the version of each of the installed packages
dbt deps
was last run? similar to a poetry.lock
file to it's pyproject.toml
counterpart?Use this file to make a more compelling user experience(s)
packages.json
being the output of what was last installed, do you think it may be better to rely on packages.yml
for the user experience side of things? i.e. packages.yml
is for what a user defines as their intended packages/versions, but packages.json
is the output of what the environment is currently using based on last install? just a thought and want to make sure i'm clear on the requirements for this.as part of this ticket, would it also be okay to contribute a new feature on top of this? what i'm proposing is to break out dbt deps
into sub-commands with the intention of serving the below purpose:
command | description |
---|---|
dbt deps add |
add a package to packages.yml (new) |
dbt deps update |
update a package in packages.yml (new) |
dbt deps remove |
remove a package from packages.yml (new) |
dbt deps install |
installs packages based on what's in packages.yml (formerly just dbt deps ) |
the add
, update
, remove
sub-commands could take the following arguments:
argument | description | command |
---|---|---|
--package |
name of package to add to packages.yml |
add , update , remove |
--version |
version of the package being packages.yml |
add , update |
--location |
location to get package from (package (hub), git, local) | add , update |
i've already started working on this locally and i don't think it would add much work to the overall effort of this ticket. here's what it may look like within the CLI:
i think this would be a helpful addition to the user experience of working with packages (via add
, update
, remove
sub-commands) in the packages.yml
file, while also giving a dedicated sub-command for a lot of the work in this ticket (via install
sub-command) in a newly proposed packages.json
to better track of what packages were last installed in a project. let me know your thoughts on this, happy to discuss some more before contributing any other code or anything like that. hope this helps, thanks!
Awesome to hear your interest @justbldwn!
Quick thoughts from me on the piece around new CLI commands:
packages.yml
, and some ideas/conversation in there that are still relevant: https://github.com/dbt-labs/dbt-core/issues/5302pip
, cargo
) don't tend to support this sort of functionality out-of-the-box. Rather, there are other popular extensions/plugins that add in these nice-to-have CLI workflows (e.g. poetry add
).dbt deps
is not built to be a full-fledged package manager. It's good for what it's good for, and it gets users a lot today, no question! But the more features we add that make deps
look like a fully featured package manager, the greater the risk that folks will expect it to be one. We lack the right foundation, appetite, and maintenance capacity to back that up.deps
functionality into its own module/package (dbt-deps
), included in the "core" dbt distribution but clearly separated from dbt-core
. I would feel better about significantly expanding its functionality after we've drawn those clear interface boundaries, and have a dedicated place for folks to bring their deps
-related issues & ideas.One of the most commonly requested enhancements to the dbt Cloud IDE is automating the dbt deps
step within the development workflow -- I am extremely interested in this issue as a way to resolve that issue and have the whole community benefit!
Feels like having dbt commands do a check of the dbt deps
artifact described above would be an extremely nice workflow here. I imagine this would also potentially help us with the request for cleaner dependency conflicts during dbt deps as well (can't find the issue)
@dbeatty10 -- definitely keen on pitching in here (likely in conjunction with @joellabes and @justbldwn !) Let me know how to best plug in here!
thanks for all the info @jtcohen6 ! if it's okay, i'll make a separate PR eventually relating to #5302 for the expansion of dbt deps into different sub-commands. good to know this issue is already outstanding, i hadn't seen it before!
as for the scope of this ticket, i just submitted an initial draft PR (#6735) for the ability to collect/write a json file to the dbt_packages/
directory with the installed packages + their dependencies. i'm happy to continue the discussion on the PR itself to know if the proposed format and location of the json file are okay and adjust where necessary.
After chatting with @dbeatty10 about phase 1, I think we should invert our approach to more closely resemble npm
:
dbt deps foo
where foo
can be one of a package published to hub.getdbt.com, a Git repo, a tarball, or a local file. This command will write to both packages.yml
and a new lock file (tentatively packages-lock.yml
).dbt deps
installs packages like before, but preferentially installs from the lock file (if available) with a fallback to packages.yml
.packages.yml
for now.This approach has a few advantages:
dbt deps
installs packages deterministically across all environments.dbt deps
sans arguments speeds up significantly since it doesn't need to perform the expensive operation of resolving package versions. This operation is only performed when packages are added/updated/removed, which is good since lock files are read much, much more than they are written to.packages.yml
appear, the lock file remains the same, thus making package updates opt-in instead of opt-out.dbt deps
does not break behavior for users on older versions of dbt
sans lock file.hey @aranke, thanks for the feedback on this! i have some thoughts/questions below.
- Adding a package –
dbt deps foo
wherefoo
can be one of a package published to hub.getdbt.com, a Git repo, a tarball, or a local file. This command will write to bothpackages.yml
and a new lock file (tentativelypackages-lock.yml
).
would the idea here be to have foo
include the source + package name as one string or allow for arguments to specify a name/source/revision, etc? something like:
dbt deps dbt-labs/dbt_utils@1.0.0
vs. dbt deps add --package dbt-labs/dbt_utils --version 1.0.0
or
dbt deps git+https://github.com/elementary-data/dbt-data-reliability@0.6.11
vs. dbt deps add --package https://github.com/elementary-data/dbt-data-reliability --version 0.6.11 --source git
with that, could this start to intrude on @jtcohen6 's stance in this comment that dbt deps
isn't mean to be a full-fledged package manager?
- Installing packages –
dbt deps
installs packages like before, but preferentially installs from the lock file (if available) with a fallback topackages.yml
.
the way i look at this, essentially the packages.yml
and the packages-lock.yml
would be identical (unless someone wants to manually adjust a version in the packages.yml
, like you said in point number 3). but then if dbt is pointing to the packages-lock.yml
to install from, how would someone's manual packages.yml
changes ever get installed? should a command be included to update the lock manually, i.e. dbt deps lock
?
dbt deps sans arguments speeds up significantly since it doesn't need to perform the expensive operation of resolving package versions. This operation is only performed when packages are added/updated/removed, which is good since lock files are read much, much more than they are written to.
do you mind diving into this point a little bit futher? i'm not fully following this, as i believe dbt deps
will install all packages in the packages.yml
file regardless of whether the entire packages.yml
changes or if just one package is added.
i'm happy to contribute more to this to keep momentum going, i'm just hoping for a bit more direction on exactly which way you'd like to take this. thanks again!
I think this is brilliant and would have saved several people hours yesterday. We need to save the commit hash because what I think happened in our chase is that a package we were using kept the same version after a change was made, so those who installed the package prior to the change were seeing a different behavior than those who installed it after the change. Making the install deterministic with a type of lock file in the repo would be ideal IMO.
@justbldwn Thanks for your detailed response and your energy in tackling this issue, it is definitely appreciated.
My previous comment was broad strokes around what a proposed solution would feel like, but I left many gaps in how to implement such a solution.
I think it is worth filling in those gaps here, so without further ado, here's how I see dbt
implementing a lock file.
I see the purpose of the lock file to be two-fold:
packages.yml
. The package names in the lock file are a strict superset of those defined in packages.yml
.packages.yml
is a strict superset of those defined in the lock file.For example, consider the following packages.yml
file:
packages:
- package: calogica/dbt_expectations
version: [">=0.8.0", "<0.9.0"]
Since dbt_expectations
has a dependency on dbt_date
, the lock file should look something like this:
packages:
- package: calogica/dbt_expectations
version: 0.8.2
- package: calogica/dbt_date
version: 0.7.2
With the introduction of the lock file, we now need to support three behaviors:
lock
(new, creates lock file)
deps <package>
(new, adds package to packages.yml
and lock file)
packages.yml
lock
.
During this process, it is fine to upgrade any package listed in packages.yml
even if that package wasn't the one passed as an argument to dbt deps
.deps
(old, installs packages defined in lock file with fall back to packages.yml
)
lock
to create a lock file.packages.yml
. If they do, re-create the lock file using lock
.dbt_packages
directory (or override in config).
In the happy path scenario, most invocations of dbt deps
skip directly here after verifying that the lock file is in sync with packages.yml
as outlined in step 2 above. Since the expensive resolution steps are already handled by lock
, this step should run really fast since it is a raw downloader.It's getting late here, and this comment is already really long, so I'll wrap up for now. Hopefully, this is enough to get you unblocked.
There are two things I'll cover in a follow-up post (or @jtcohen6 can weigh in):
Once again, thank you so much for your willingness to help here and look forward to your contributions!
@aranke thank you again so much for providing all this info and context when you did. it was extremely helpful and definitely allowed me to proceed forward. i'm hoping to submit a PR soon with changes 🎉 .
one question i came across when working with deps code is related to the dbt_packages/
directory. simply put, should the packages installed in dbt_packages/
persist regardless of what's in packages.yml
/packages-lock.yml
?
lets say i have 5 packages listed in my packages.yml
that looks like this:
packages:
- package: dbt-labs/dbt_utils
version: 0.8.4
- git: "https://github.com/splitgraph/ab2ft_stripe"
revision: 0.1.0
- package: dbt-labs/dbt_utils
version: 0.8.4
- git: "https://github.com/google/fhir-dbt-analytics"
revision: 386e8430a4b9735466c90c3313ae67c31dbef58f
- package: everpeace/dbt_models_metadata
version: 0.1.0
and i decide that i only want my project's packages.yml
to look like this going forward:
packages:
- package: dbt-labs/dbt_utils
version: 1.0.0
should we wipe dbt_packages/
and only installed what's defined to also stay in sync with what's going to be defined in the packages.yml
and the new packages-lock.yml
? currently, any installed packages throughout a project's lifetime persist in the dbt_packages/
directory, regardless of what's defined in packages.yml
.
hope that makes sense - let me know if you have any comments/questions!
@justbldwn Good catch, yes, we should make sure that the state in packages-install-path
(dbt_packages/
is the default) lines up with packages.yml
and package-lock.yml
.
For now, this can take the form of wiping and recreating the directory, but in the future we should do the following instead:
packages-install-path
.package-lock.yml
.Furthermore, I think we should call the file package-lock.yml
since the article (lock) is singular and follows convention from npm
.
PR submitted #6735 and marked ready for review! there is a lot of context and info included in the PR summary. not sure if discussion should continue here or there, but happy to help whichever way i can 👍
@aranke @jtcohen6 @dbeatty10 @dave-connors-3
Will add more details about the later half soon
Hello @ChenyuLInx @justbldwn @aranke We're in a bit of an awkward situation where PR #8408 (available in 1.7.0rc1) broke #5374 (fixed with PR #8322 by me)
Now the problem is with the method _create_packages_yml_entry()
where we don't take into account the subdirectory
for a git repository and also package.name
is used but had been modified in core/dbt/deps/git.py
Now that I understand what you're trying to do here and there might be something I can do about the name
property in the previous file (I can work on a fix but the name corresponds to either the git repo or git repo + subdirectory if any) but we also need to take into account the subdirectory in the lock file in #6735
The lock file created doesn't respect the packages.yaml:
packages:
- package: "calogica/dbt_date"
version: [ ">=0.5.4", "<0.8.0" ]
- package: "dbt-labs/dbt_utils"
version: [ ">=1.0.0", "<2.0.0" ]
- git: "git@github.com:acme/git-repo.git"
subdirectory: "dbt-source-pkg-1"
revision: "dbt-source-pkg-1-v0.1.12"
- git: "git@github.com:acme/git-repo.git"
subdirectory: "dbt-source-pkg-2"
revision: "dbt-source-pkg-2-v0.1.2"
It will instead create the following lock file
❯ cat package-lock.yml
packages:
- package: calogica/dbt_date
version: 0.7.2
- package: dbt-labs/dbt_utils
version: 1.1.1
- git: git@github.com:acme/git-repo.git/dbt-source-pkg-1
revision: dbt-source-pkg-1-v0.1.12
- git: git@github.com:acme/git-repo.git/dbt-source-pkg-2
revision: dbt-source-pkg-1-v0.1.2
sha1_hash: c42f2<redacted>f258
Resulting in a problem when executing dbt deps
since git@github.com:acme/git-repo.git/dbt-source-pkg-1
is not a valid repository name
@philippeboyd, I am using http not ssh for the git repos. These are still resolving and cloning properly even with the subdirectory on the end. I just needed to add the subdirectory line into the lock file so dbt deps knew to find the dbt_project.yml
. Not sure if this helps.
Hi @ryan-pip, of course if you manually modify the lock file to add the subdirectory at the proper place it will work but it doesn't fix the problem that I mentioned in my previous post.
packages:
- git: git@github.com:philippeboyd/dbt-monorepo.git
subdirectory: dbt-utils-main
revision: v0.1.0
- git: git@github.com:philippeboyd/dbt-monorepo.git
subdirectory: dbt-date-main
revision: v0.1.0
sha1_hash: d3336208deda31c19fe1230443c19f7000d3e40e
These are still resolving and cloning properly even with the subdirectory on the end.
Doubt. See my example below.
To replicate the issue easily, I created a repository https://github.com/philippeboyd/dbt-monorepo.git with dbt-date
and dbt-utils
as the 2 subdirectories in that mono repo.
I tried both the following (git and https) and got the same result
packages:
- git: "git@github.com:philippeboyd/dbt-monorepo.git"
subdirectory: "dbt-utils-main"
revision: "v0.1.0"
- git: "git@github.com:philippeboyd/dbt-monorepo.git"
subdirectory: "dbt-date-main"
revision: "v0.1.0"
packages:
- git: "https://github.com/philippeboyd/dbt-monorepo.git"
subdirectory: "dbt-utils-main"
revision: "v0.1.0"
- git: "https://github.com/philippeboyd/dbt-monorepo.git"
subdirectory: "dbt-date-main"
revision: "v0.1.0"
Respective package-lock.yml
files
packages:
- git: git@github.com:philippeboyd/dbt-monorepo.git/dbt-utils-main
revision: v0.1.0
- git: git@github.com:philippeboyd/dbt-monorepo.git/dbt-date-main
revision: v0.1.0
sha1_hash: d3336208deda31c19fe1230443c19f7000d3e40e
packages:
- git: https://github.com/philippeboyd/dbt-monorepo.git/dbt-utils-main
revision: v0.1.0
- git: https://github.com/philippeboyd/dbt-monorepo.git/dbt-date-main
revision: v0.1.0
sha1_hash: 140e9796f695ebcf548343394d4aa90c3480f477
Both will result in the respective error
❯ dbt deps
02:58:44 Running with dbt=1.7.0-rc1
02:58:51 Updating lock file in file path: /tmp/tmp.lIjWExOoUt/tester/package-lock.yml
02:58:51 Installing git@github.com:philippeboyd/dbt-monorepo.git/dbt-utils-main
02:58:51 Encountered an error:
Internal Error
Error checking out spec='None' for repo git@github.com:philippeboyd/dbt-monorepo.git/dbt-utils-main
Cloning into 'cbe032b9b9b68984ccc507ae7f57690c'...
fatal: remote error:
philippeboyd/dbt-monorepo.git/dbt-utils-main is not a valid repository name
Visit https://support.github.com/ for help
❯ dbt deps
02:59:22 Running with dbt=1.7.0-rc1
02:59:25 Updating lock file in file path: /tmp/tmp.lIjWExOoUt/tester/package-lock.yml
02:59:25 Installing https://github.com/philippeboyd/dbt-monorepo.git/dbt-utils-main
02:59:25 Encountered an error:
Internal Error
Error checking out spec='None' for repo https://github.com/philippeboyd/dbt-monorepo.git/dbt-utils-main
Cloning into 'fded50563e6648f3ddd7ba6d2bf6fc14'...
remote: Not Found
fatal: repository 'https://github.com/philippeboyd/dbt-monorepo.git/dbt-utils-main/' not found
Now if I install it with prior version 1.7.0-b2
, it works
❯ dbt deps
03:07:08 Running with dbt=1.7.0-b2
03:07:15 Installing git@github.com:philippeboyd/dbt-monorepo.git/dbt-utils-main
03:07:16 Installed from revision v0.1.0
03:07:16 and subdirectory dbt-utils-main
03:07:16 Installing git@github.com:philippeboyd/dbt-monorepo.git/dbt-date-main
03:07:17 Installed from revision v0.1.0
03:07:17 and subdirectory dbt-date-main
@philippeboyd Sorry for seeing this late. And thanks a lot for making that public repo and make this easy to reproduce I looked through the changes, and seems that your change with formatting the subdirectories into name is the right move. I am gonna rework the deps lock piece to make sure the proper lock file is generated for this situation. WIP at https://github.com/dbt-labs/dbt-core/pull/9019
Hi all. Before I open a bug as a separate issue, I want to run it past the people collaborating on this issue and see if anyone disagrees with my suggestion?
package-lock.yml
doesn't pass yamllint
because it doesn't have the ---
document separator at the top, and the list within packages:
isn't indented.
We'd like it to pass yamllint
. As a workaround we're fixing it before committing.
Thoughts?
Is this your first time submitting a feature request?
Describe the feature
Problem
It's hard to determine the exact package versions that were installed during the
dbt deps
process.Multi-part proposal
We should consider a feature with three phases, each subsequent one being optional:
dbt deps
, write a file with the version of each of the installed packagespackages.json
.gitignore
'd directory (dbt_packages
,target
, or something else)pip freeze
orpip-tools compile
dbt deps
automatically as part of every relevant sub-commanddbt deps
such a delightful and pleasurable experience, why not include it within every sub-command?The advantage provided by the 1st phase is that users (and systems!) would know the exact package versions that were installed during the
dbt deps
process.I'm initially keeping these three phases together to preserve their common context, but we can split these out into separate issues at any time.
The 1st phase described above is actionable right now, but by the time we get to the 3rd one, we're definitely in speculative territory.
Describe alternatives you've considered
Currently, all dbt packages have a required
version
that is specified withindbt_project.yml
. The problem is that this field has been vestigial its entire lifespan, and it can deviate (and has!) from its associate git tag version numbers that are used by the dbt Package Hub. Furthermore, the implementation of https://github.com/dbt-labs/dbt-core/issues/6603 will remove thatversion
. So we can't count on it.In my view, the best things to consider are all the sub-alternatives and design decisions within the 1st phase proposal:
packages.json
vs. another name)dbt_packages
vs.target
vs. something else)version
field withindbt_project.yaml
that is guaranteed not to disagree with any associated git tags...Who will this benefit?
This could benefit multiple types of consumers:
packages.yml
for repeatable buildsAre you interested in contributing this feature?
Can serve in a supporting role to whoever else picks this up
Anything else?
No response