dbt-labs / dbt-core

dbt enables data analysts and engineers to transform their data using the same practices that software engineers use to build applications.
https://getdbt.com
Apache License 2.0
9.89k stars 1.63k forks source link

[CT-3507] [Bug] Installing packages to cache location fails with "Cannot call rmtree on a symbolic link" #9304

Open moltar opened 10 months ago

moltar commented 10 months ago

Is this a new bug in dbt-core?

Current Behavior

Running dbt deps in AWS CodeBuild results in this error:

Cannot call rmtree on a symbolic link

On this line: https://github.com/dbt-labs/dbt-core/blame/2401600e57048dd56818f7293abed96ffd510ac9/core/dbt/task/deps.py#L231

Notable observation is that we use CodeBuild caching for dbt packages.

Expected Behavior

Not fail.

Steps To Reproduce

  1. Run dbt 1.7.0 in CodeBuild
  2. Store packages in cache (/root/.cache/dbt_packages)
  3. Run dbt deps

Relevant log output

[
  {
    "data": {
      "log_version": 3,
      "version": "=1.7.3"
    },
    "info": {
      "category": "",
      "code": "A001",
      "extra": {},
      "invocation_id": "454cd6b0-139a-47fc-8d90-52dd3e52bf5a",
      "level": "info",
      "msg": "Running with dbt=1.7.3",
      "name": "MainReportVersion",
      "pid": 231,
      "thread": "MainThread",
      "ts": "2023-12-19T11:02:05.648499Z"
    }
  },
  {
    "data": {
      "exc": "Cannot call rmtree on a symbolic link"
    },
    "info": {
      "category": "",
      "code": "Z002",
      "extra": {},
      "invocation_id": "454cd6b0-139a-47fc-8d90-52dd3e52bf5a",
      "level": "error",
      "msg": "Encountered an error:\nCannot call rmtree on a symbolic link",
      "name": "MainEncounteredError",
      "pid": 231,
      "thread": "MainThread",
      "ts": "2023-12-19T11:02:06.100657Z"
    }
  },
  {
    "data": {
      "stack_trace": "Traceback (most recent call last):\n  File \"/root/.cache/venv/lib/python3.11/site-packages/dbt/cli/requires.py\", line 90, in wrapper\n    result, success = func(*args, **kwargs)\n                      ^^^^^^^^^^^^^^^^^^^^^\n  File \"/root/.cache/venv/lib/python3.11/site-packages/dbt/cli/requires.py\", line 75, in wrapper\n    return func(*args, **kwargs)\n           ^^^^^^^^^^^^^^^^^^^^^\n  File \"/root/.cache/venv/lib/python3.11/site-packages/dbt/cli/requires.py\", line 151, in wrapper\n    return func(*args, **kwargs)\n           ^^^^^^^^^^^^^^^^^^^^^\n  File \"/root/.cache/venv/lib/python3.11/site-packages/dbt/cli/requires.py\", line 197, in wrapper\n    return func(*args, **kwargs)\n           ^^^^^^^^^^^^^^^^^^^^^\n  File \"/root/.cache/venv/lib/python3.11/site-packages/dbt/cli/main.py\", line 492, in deps\n    results = task.run()\n              ^^^^^^^^^^\n  File \"/root/.cache/venv/lib/python3.11/site-packages/dbt/task/deps.py\", line 228, in run\n    system.rmtree(self.project.packages_install_path)\n  File \"/root/.cache/venv/lib/python3.11/site-packages/dbt/clients/system.py\", line 570, in rmtree\n    return shutil.rmtree(path, onerror=chmod_and_retry)\n           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n  File \"/root/.pyenv/versions/3.11.6/lib/python3.11/shutil.py\", line 744, in rmtree\n    onerror(os.path.islink, path, sys.exc_info())\n  File \"/root/.pyenv/versions/3.11.6/lib/python3.11/shutil.py\", line 742, in rmtree\n    raise OSError(\"Cannot call rmtree on a symbolic link\")\nOSError: Cannot call rmtree on a symbolic link\n"
    },
    "info": {
      "category": "",
      "code": "Z003",
      "extra": {},
      "invocation_id": "454cd6b0-139a-47fc-8d90-52dd3e52bf5a",
      "level": "error",
      "msg": "Traceback (most recent call last):\n  File \"/root/.cache/venv/lib/python3.11/site-packages/dbt/cli/requires.py\", line 90, in wrapper\n    result, success = func(*args, **kwargs)\n                      ^^^^^^^^^^^^^^^^^^^^^\n  File \"/root/.cache/venv/lib/python3.11/site-packages/dbt/cli/requires.py\", line 75, in wrapper\n    return func(*args, **kwargs)\n           ^^^^^^^^^^^^^^^^^^^^^\n  File \"/root/.cache/venv/lib/python3.11/site-packages/dbt/cli/requires.py\", line 151, in wrapper\n    return func(*args, **kwargs)\n           ^^^^^^^^^^^^^^^^^^^^^\n  File \"/root/.cache/venv/lib/python3.11/site-packages/dbt/cli/requires.py\", line 197, in wrapper\n    return func(*args, **kwargs)\n           ^^^^^^^^^^^^^^^^^^^^^\n  File \"/root/.cache/venv/lib/python3.11/site-packages/dbt/cli/main.py\", line 492, in deps\n    results = task.run()\n              ^^^^^^^^^^\n  File \"/root/.cache/venv/lib/python3.11/site-packages/dbt/task/deps.py\", line 228, in run\n    system.rmtree(self.project.packages_install_path)\n  File \"/root/.cache/venv/lib/python3.11/site-packages/dbt/clients/system.py\", line 570, in rmtree\n    return shutil.rmtree(path, onerror=chmod_and_retry)\n           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n  File \"/root/.pyenv/versions/3.11.6/lib/python3.11/shutil.py\", line 744, in rmtree\n    onerror(os.path.islink, path, sys.exc_info())\n  File \"/root/.pyenv/versions/3.11.6/lib/python3.11/shutil.py\", line 742, in rmtree\n    raise OSError(\"Cannot call rmtree on a symbolic link\")\nOSError: Cannot call rmtree on a symbolic link\n",
      "name": "MainStackTrace",
      "pid": 231,
      "thread": "MainThread",
      "ts": "2023-12-19T11:02:06.178555Z"
    }
  }
]

Environment

- OS: Ubuntu 22.04 (`aws/codebuild/standard:7.0`)
- Python: 3.11
- dbt: 1.7.3

Which database adapter are you using with dbt?

No response

Additional Context

Using cached location for dbt packages.

I've tried narrowing it down, and the issue starts happening on 1.7.0 release.

dbeatty10 commented 10 months ago

Thanks for raising this and providing a link back to the suspected code @moltar

Three questions:

Reprex

I got the same error as you when I did the following in a zsh shell in macOS:

  1. Create a separate directory to install dbt packages within and create a symbolic link to it:
    mkdir -p .cache/dbt_packages
    ln -s .cache/dbt_packages dbt_packages
  2. Add dbt_packages as the packages-install-path within dbt_project.yml
    packages-install-path: dbt_packages
  3. Install dependencies

    dbt deps
      File "/Users/dbeatty/projects/environments/postgres_1.7/lib/python3.10/site-packages/dbt/task/deps.py", line 228, in run
        system.rmtree(self.project.packages_install_path)
      File "/Users/dbeatty/projects/environments/postgres_1.7/lib/python3.10/site-packages/dbt/clients/system.py", line 570, in rmtree
        return shutil.rmtree(path, onerror=chmod_and_retry)
      File "/Users/dbeatty/.pyenv/versions/3.10.10/lib/python3.10/shutil.py", line 737, in rmtree
        onerror(os.path.islink, path, sys.exc_info())
      File "/Users/dbeatty/.pyenv/versions/3.10.10/lib/python3.10/shutil.py", line 735, in rmtree
        raise OSError("Cannot call rmtree on a symbolic link")
    OSError: Cannot call rmtree on a symbolic link

Possible workaround

But it worked for me when I did the following instead:

  1. Add .cache/dbt_packages as the packages-install-path within dbt_project.yml
    packages-install-path: .cache/dbt_packages
moltar commented 10 months ago

How is the symlink being created?

I am not sure, how, or whether that is even a symlink. It's a directory, provided by CodeBuild, which will be cached across runs. How they provision this directory is unknown (to me).

As part of the build spec, i specify the cache dir:

cache: {
          paths: [
            // cache dbt packages
            "/root/.cache/dbt_packages",
          ],
        },

Are you using the packages-install-path within dbt_project.yml?

Yes, set via env:

packages-install-path: "{{ env_var('DBT_PACKAGES_INSTALL_PATH', 'dbt_packages') }}"

Does the "Possible workaround" below work for you?

So you are suggesting, essentially, add another directory layer, so that when cleaning it out, we do not try to clean the top level link, and only unlink whatever is inside the dir?

I mean I think it would work, don't see a reason not. For now, the workaround was just to disable caching, but I will try your workaround.

moltar commented 10 months ago

To answer the linking question (partially), this is from the CodeBuild log:

[Container] 2023/12/19 15:21:43.569893 Moving to directory /codebuild/output/src2019796648/src
[Container] 2023/12/19 15:21:43.571322 Expanded cache path /root/.cache
[Container] 2023/12/19 15:21:50.655824 MkdirAll: /codebuild/local-cache/custom/ebc372a1c9f0ee32803d1ef5dc06a690f02a9133f92ecab5a21fa9c4bf851f2b/root/.cache
[Container] 2023/12/19 15:21:50.656107 Symlinking: /root/.cache => /codebuild/local-cache/custom/ebc372a1c9f0ee32803d1ef5dc06a690f02a9133f92ecab5a21fa9c4bf851f2b/root/.cache

Still does not give us the full answer as to how it does the symlinking, but at least the dir structure is clear.

moltar commented 10 months ago

And that workaround does work, btw. Thanks! 🎉

dbeatty10 commented 10 months ago

And that workaround does work, btw. Thanks! 🎉

Did the way that you implemented the workaround successfully use the installs that are cached across runs? Or did it just skip the cached portion in favor of creating a new local directory named .cache/dbt_packages?

moltar commented 10 months ago

Because CodeBuild does not guarantee caching, if using local, I think it's hard to tell. It's "best effort" if you get lucky and get placed on the same machine ;)

dbeatty10 commented 10 months ago

Acceptance criteria

Implementation idea

One way to solve the case when the packages-install-path is a symlink is to re-use the approach from here.

i.e., replace the code here with this instead:

        dest_path = self.project.packages_install_path
        if system.path_exists(dest_path):
            if system.path_is_symlink(dest_path):
                system.remove_file(dest_path)
            else:
                system.rmdir(dest_path)