elementary-data / elementary

The dbt-native data observability solution for data & analytics engineers. Monitor your data pipelines in minutes. Available as self-hosted or cloud service with premium features.
https://www.elementary-data.com/
Apache License 2.0
1.92k stars 164 forks source link

Generate report fails on Databricks Shared cluster #1584

Open thijs-nijhuis opened 4 months ago

thijs-nijhuis commented 4 months ago

Describe the bug When you run the 'edr report' command from a notebook that has elementary installed as a cluster library (so it is installed on start up and persisted across sessions), the report generation will fail on a permission error when trying to run 'dbt deps' if the cluster is in 'shared' access mode. If the cluster is in 'single user' access mode the command will succeed.

To Reproduce

  1. Create an all purpose compute cluster with access mode 'shared'
  2. Install the "elementary-data==1.5.1" from PyPi on it
  3. Connect to a GitHub repo that contains a DBT project opr upload one to your workspace
  4. Create a new Notebook with only one Python cell that contains this command:
    %sh
    edr report --profiles-dir "/Workspace/Repos/<username>/<repo_name>/<path_to_project_folder>" --project-dir "/Workspace/Repos/<username>/<repo_name>/<path_to_project_folder>" --target-path "/Workspace/Repos/<username>/<repo_name>/<path_to_a_folder>" --update-dbt-package false
  5. Attach the notebook to the create cluster and run the cell

Expected behavior I expected the the report to be generated at the provided location, just like it does when using a cluster in 'Single-user' mode.

Screenshots

    ________                          __                  
   / ____/ /__  ____ ___  ___  ____  / /_____ ________  __
  / __/ / / _ \/ __ `__ \/ _ \/ __ \/ __/ __ `/ ___/ / / /
 / /___/ /  __/ / / / / /  __/ / / / /_/ /_/ / /  / /_/ / 
/_____/_/\___/_/ /_/ /_/\___/_/ /_/\__/\__,_/_/   \__, /  
                                                 /____/   

Any feedback and suggestions are welcomed! join our community here - https://bit.ly/slack-elementary

2024-07-03 15:09:33 — INFO — Running with edr=0.15.1
2024-07-03 15:09:34 — INFO — Installing packages for edr internal dbt package...
2024-07-03 15:09:34 — INFO — Running dbt --log-format json deps --project-dir /local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/elementary/monitor/dbt_project --profiles-dir /Workspace/Repos/<username>/<repo_name>/<path_to_project_folder>
2024-07-03 15:09:40 — INFO — Running with dbt=1.8.3
2024-07-03 15:09:40 — INFO — Encountered an error:
[Errno 13] Permission denied: '/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/elementary/monitor/dbt_project/package-lock.yml'
2024-07-03 15:09:40 — INFO — Traceback (most recent call last):
  File "/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/dbt/cli/requires.py", line 138, in wrapper
    result, success = func(*args, **kwargs)
  File "/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/dbt/cli/requires.py", line 101, in wrapper
    return func(*args, **kwargs)
  File "/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/dbt/cli/requires.py", line 201, in wrapper
    return func(*args, **kwargs)
  File "/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/dbt/cli/requires.py", line 247, in wrapper
    return func(*args, **kwargs)
  File "/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/dbt/cli/main.py", line 447, in deps
    results = task.run()
  File "/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/dbt/task/deps.py", line 217, in run
    self.lock()
  File "/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/dbt/task/deps.py", line 204, in lock
    with open(lock_filepath, "w") as lock_obj:
PermissionError: [Errno 13] Permission denied: '/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/elementary/monitor/dbt_project/package-lock.yml'

Traceback (most recent call last):
  File "/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/elementary/clients/dbt/dbt_runner.py", line 88, in _run_command
    result = subprocess.run(
  File "/usr/lib/python3.10/subprocess.py", line 526, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['dbt', '--log-format', 'json', 'deps', '--project-dir', '/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/elementary/monitor/dbt_project', '--profiles-dir', '/Workspace/Repos/<username>/<repo_name>/<path_to_project_folder>']' returned non-zero exit status 2.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/local_disk0/.ephemeral_nfs/cluster_libraries/python/bin/edr", line 8, in <module>
    sys.exit(cli())
  File "/databricks/python/lib/python3.10/site-packages/click/core.py", line 1128, in __call__
    return self.main(*args, **kwargs)
  File "/databricks/python/lib/python3.10/site-packages/click/core.py", line 1053, in main
    rv = self.invoke(ctx)
  File "/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/elementary/cli/cli.py", line 67, in invoke
    return super().invoke(ctx)
  File "/databricks/python/lib/python3.10/site-packages/click/core.py", line 1659, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/databricks/python/lib/python3.10/site-packages/click/core.py", line 1395, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/databricks/python/lib/python3.10/site-packages/click/core.py", line 754, in invoke
    return __callback(*args, **kwargs)
  File "/databricks/python/lib/python3.10/site-packages/click/decorators.py", line 26, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/elementary/monitor/cli.py", line 442, in report
    data_monitoring = DataMonitoringReport(
  File "/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/elementary/monitor/data_monitoring/report/data_monitoring_report.py", line 42, in __init__
    super().__init__(
  File "/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/elementary/monitor/data_monitoring/data_monitoring.py", line 35, in __init__
    self.internal_dbt_runner = self._init_internal_dbt_runner()
  File "/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/elementary/monitor/data_monitoring/data_monitoring.py", line 61, in _init_internal_dbt_runner
    internal_dbt_runner = DbtRunner(
  File "/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/elementary/clients/dbt/dbt_runner.py", line 48, in __init__
    self._run_deps_if_needed()
  File "/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/elementary/clients/dbt/dbt_runner.py", line 318, in _run_deps_if_needed
    self.deps()
  File "/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/elementary/clients/dbt/dbt_runner.py", line 116, in deps
    success, _ = self._run_command(
  File "/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/elementary/clients/dbt/dbt_runner.py", line 99, in _run_command
    raise DbtCommandError(err, command_args, logs=logs)
elementary.exceptions.exceptions.DbtCommandError: Failed to run dbt command.
Encountered an error:
[Errno 13] Permission denied: '/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/elementary/monitor/dbt_project/package-lock.yml'
Traceback (most recent call last):
  File "/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/dbt/cli/requires.py", line 138, in wrapper
    result, success = func(*args, **kwargs)
  File "/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/dbt/cli/requires.py", line 101, in wrapper
    return func(*args, **kwargs)
  File "/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/dbt/cli/requires.py", line 201, in wrapper
    return func(*args, **kwargs)
  File "/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/dbt/cli/requires.py", line 247, in wrapper
    return func(*args, **kwargs)
  File "/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/dbt/cli/main.py", line 447, in deps
    results = task.run()
  File "/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/dbt/task/deps.py", line 217, in run
    self.lock()
  File "/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/dbt/task/deps.py", line 204, in lock
    with open(lock_filepath, "w") as lock_obj:
PermissionError: [Errno 13] Permission denied: '/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/elementary/monitor/dbt_project/package-lock.yml'

Environment (please complete the following information):

Additional context I did a bit of debugging and testing. When running the command it seems to check if the dbt packages from the project and from the internal dbt project are installed. The first one succeeds because the project is in a writable location. The seconds fails because it tries to write/create a file called package-lock.yml in the internal dbt project inside the elementary package folder. This folder is not writable on a shared cluster (I am actually surprised that it IS writeable on a single user cluster).

I also tried installing elementary as part of the notebook instead of on cluster startup, like so: %pip install elementary-data=0.15.1. After you restart the Python kernel and run the same command it DOES succeed. This is because the elementary package in this case is installed in a location that is writeable for the logged in user. Unfortunately this is not an option for us as we run our project as a wheel and both elementary and dbt-databricks are installed as part of that wheel.

Maybe it is an idea to have the dbt_packages pre-installed when installing elementary? That way dbt deps won't need to write anything and it would also speed up the process a bit. This might fail when it tries to create a target folder though. Alternatively, perhaps we can configure the location of all writeable locations (target and dbt_packages) as part of the edr command? Just like we can configure the location of the report output.

Would you be willing to contribute a fix for this issue? Sure.

noel commented 2 months ago

Any update on this? is there a work-around?

alxsbn commented 5 days ago

@thijs-nijhuis @noel I have the same behavior with send-report with dbt 1.8.7 (Databricks too) where I try to execute the command within a contianer (no elevation). Did you find a workaround ?

noel commented 5 days ago

no, I forked the repo and added the missing file. wish they would add it so we dont need a fork