dbt-labs / dbt-core

dbt enables data analysts and engineers to transform their data using the same practices that software engineers use to build applications.
https://getdbt.com
Apache License 2.0
9.54k stars 1.58k forks source link

[CT-2876] [Feature] add possibility to copy dbt local packages instead of make it symlink #8223

Open iamtodor opened 1 year ago

iamtodor commented 1 year ago

Is this your first time submitting a feature request?

Describe the feature

Hello team,

This is a feature request that comes from the following issues/discussions:

In this piece of code https://github.com/dbt-labs/dbt-core/blob/main/core/dbt/deps/local.py#L62

 if can_create_symlink: 
     fire_event(DepsCreatingLocalSymlink())
     system.make_symlink(src_path, dest_path)
 else: 
     fire_event(DepsSymlinkNotAvailable())
     shutil.copytree(src_path, dest_path)

we are evaluating whether symlink can be setuped. In case the OS does not support symlink (i.e. Windows) we do a deep copy.

We would like to be able to do a dbt local packages deep copy on Linux machines, for instance.

The reasons for this aim are:

Describe alternatives you've considered

As a workaround, I had to write custom Python code, that iterates thru all the packages we have, find the symlink, and make a deep copy of all the content.

Who will this benefit?

In my humble opinion, at least, GCP users would benefit a lot

Are you interested in contributing this feature?

There is a chance, with all the context and pitfalls to be provided

Anything else?

Probably, no. However, if I miss something please feel free to ask all the questions and raise all the concerns you might be having

dbeatty10 commented 1 year ago

Thanks for suggesting this idea @iamtodor !

We always want to be careful with Chesterton's fence, and I don't know the original rationale that went into the decision to utilize symlinks for local packages in https://github.com/dbt-labs/dbt-core/commit/abdcdefb5f65388105bb7c93552d9151f2f0d98a.

But I'm guessing that it was to avoid duplicating data. If that were the case, then it seems totally fine to me to replace it with a deep copy instead since it wouldn't take up any more additional space than if it had been installed from a remote source. In fact, doing a deep copy seems like it would just bring local packages in line with the way other install methods are written to the file system so it seems safe rather than risky.

I'm wondering if using a deep copy might even be more safe when using dbt clean. I might be misremembering, but I seem to recall accidentally wiping out a local package that way a couple years ago.

I'm going to label this as help_wanted if you would be interested in contributing a PR.

iamtodor commented 1 year ago

Thanks for suggesting this idea @iamtodor !

We always want to be careful with Chesterton's fence, and I don't know the original rationale that went into the decision to utilize symlinks for local packages in abdcdef and

But I'm guessing that it was to avoid duplicating data. If that were the case, then it seems totally fine to me to replace it with a deep copy instead since it wouldn't take up any more additional space than if it had been installed from a remote source. In fact, doing a deep copy seems like it would just bring local packages in line with the way other install methods are written to the file system so it seems safe rather than risky.

I'm wondering if using a deep copy might even be more safe when using dbt clean. I might be misremembering, but I seem to recall accidentally wiping out a local package that way a couple years ago.

I'm going to label this as help_wanted if you would be interested in contributing a PR.

Hello @dbeatty10

Thank you for the reply! I would like to clarify that I do not propose to remove this logic. What I propose is to add the opportunity to control the logic: let's say by default DBT creates symlinks. However, I am able to change this default behavior to install packages with shutil.copytree()

Does it seem reasonable to you?

dbeatty10 commented 1 year ago

I would like to clarify that I do not propose to remove this logic. What I propose is to add the opportunity to control the logic: let's say by default DBT creates symlinks. However, I am able to change this default behavior to install packages with shutil.copytree()

Does it seem reasonable to you?

https://github.com/dbt-labs/dbt-core/pull/7447 proposes to just try creating a symlink and fall back to the deep copy in the case that it doesn't work (i.e., raises an error because it isn't supported on the underlying system).

This seems like it would solve the problem without the user needing to add any configuration. Would that approach work for your use case, @iamtodor?

iamtodor commented 1 year ago

I would like to clarify that I do not propose to remove this logic. What I propose is to add the opportunity to control the logic: let's say by default DBT creates symlinks. However, I am able to change this default behavior to install packages with shutil.copytree() Does it seem reasonable to you?

7447 proposes to just try creating a symlink and fall back to the deep copy in the case that it doesn't work (i.e., raises an error because it isn't supported on the underlying system).

This seems like it would solve the problem without the user needing to add any configuration. Would that approach work for your use case, @iamtodor?

Thank you @dbeatty10 for keeping your attention to this issue. The proposed way wouldn't work for us due to the fact we run dbt deps while running Jenkins CI/CD that deploys code to GCP Composer.

So for us would be really convenient to have the configuration option that we could tweak

iamtodor commented 1 year ago

@gshank may I ask you to re-open the issue? As I have explicitly confirmed that solution coming from #7447 would not fit our needs. CCing @dbeatty10

iamtodor commented 1 year ago

@gshank @dbeatty10 Our Jenkins CI/CD OS allows us to create the symlink, hence #7447 is not applicable.

The issue is while doing copy from the Jenkins machine to GCS. I pointed out these points in my very first message:

The reasons for this aim are:

Please, re-open the issue as it still exists. These are 2 very different use cases. Shall you have more concerns or questions please let me know.

dbeatty10 commented 1 year ago

Re-opening per request.

Ideally, we'd be able to implement a solution that is both simple for a user and also simple from a code maintenance perspective. To achieve both of those simultaneously, I'm attracted to just removing the symlinks altogether. Doing so would duplicate the code for installed packages, but dbt packages don't tend to take up that much space.

iamtodor commented 1 year ago

I'm attracted to just removing the symlinks altogether. Doing so would duplicate the code for installed packages

This is exactly what I had on my mind, but I thought it might be not the very best possible solution, that's why I didn't mention and presented it to you.

Usually, we write packages quite slim, hence I do not see any potential pitfalls in this particular implementation.

dbeatty10 commented 1 year ago

@graciegoheen labeling this as Refinement since I want to get your take on the following proposal:

See discussion above for more details and justification. Assuming you will tap in others to weigh in as-needed 🙏 If desired, I can do a more detailed write-up of pros/cons.

jaklan commented 5 months ago

@dbeatty10 I have found this issue a bit by coincidence, but I was terrified by the proposal of removing symlinks for local packages... It's absolutely critical functionality when working with packages in monorepo setup, when you always want to refer to the state of packages at the current commit and not run dbt deps each time you rebase your branch. Not to mention it's just a de facto standard that local packages can be installed as symlinks* by package installers (pip, npm etc.).

To resolve the issue and not break any workflows - just introduce the similar option like in pip - regular installs vs editable installs:
https://pip.pypa.io/en/stable/topics/local-project-installs/#local-project-installs

*To be more precise - in the Python case it's a bit more sophisticated than relying literally on symlinks: https://stackoverflow.com/a/77455221/6534668
and that's also the reason why editable installs work smoothly on Windows. It would be great to see a similar mechanism in dbt as we also encounter issues with our Windows developers working with local packages - if symlinks creation fails, shutil.copytree is used and changes in packages are not reflected without re-running dbt deps.

That's also why I think the current fallback approach should be completely removed - there's a significant difference between symlinking and copying files, these two mechanisms simply shouldn't be mixed when running exactly the same command. Here I would again refer to Python editable installs, when you explicitly decide which mechanism you want to use.

So to sum up: