databricks / containers

Sample base images for Databricks Container Services
Apache License 2.0
165 stars 116 forks source link

standard:13.3-LTS image pushed on Feb 8 results in Py4JException on every pyspark notebook command #171

Closed kelseyfrancis closed 7 months ago

kelseyfrancis commented 7 months ago

The databricksruntime/standard:13.3-LTS imaged pushed on Feb 8 seems to cause the following error coming from within databricks runtime code when executing any command in a pyspark notebook, even as simple as print('hello') in the first code block:

Py4JException: An exception was raised by the Python Proxy. Return Message: Traceback (most recent call last):
  File "/databricks/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/clientserver.py", line 617, in _call_proxy
    return_value = getattr(self.pool[obj_id], method)(*params)
  File "/databricks/python_shell/dbruntime/pythonPathHook.py", line 118, in initStartingDirectory
    self._handle_sys_path_maybe_updated()
  File "/databricks/python_shell/dbruntime/pythonPathHook.py", line 90, in _handle_sys_path_maybe_updated
    self._restart_language_server_if_needed()
  File "/databricks/python_shell/dbruntime/pythonPathHook.py", line 85, in _restart_language_server_if_needed
    ls_manager.restart()
  File "/databricks/python_shell/dbruntime/lsp_backend/lsp_manager.py", line 348, in restart
    self.start()
  File "/databricks/python_shell/dbruntime/lsp_backend/lsp_manager.py", line 290, in start
    self.server_process = subprocess.Popen(
  File "/usr/lib/python3.10/subprocess.py", line 971, in __init__
    self._execute_child(args, executable, preexec_fn, close_fds,
  File "/usr/lib/python3.10/subprocess.py", line 1863, in _execute_child
    raise child_exception_type(errno_num, err_msg, err_filename)
FileNotFoundError: [Errno 2] No such file or directory: 'pylsp'

Reverting to the previous image digest for that tag seems to fix this issue.

xinzhao-db commented 7 months ago

hmmm... @kelseyfrancis can you provide the previous image digest? thanks!

kelseyfrancis commented 7 months ago

docker.io/databricksruntime/standard:13.3-LTS@sha256:5d1547b9d21c9e80acc03e001533e842ebaf299f4d4b479809260112cc1ec72c

xinzhao-db commented 7 months ago

We made some changes to DCS base image, https://github.com/databricks/containers/commits/release-13.3-LTS/ubuntu, but I was really surprised if they would affect DBR so fundamentally, so I suspect it's a change in DBR causing the issue rather than DCS base image change.

I hit issue on downloading image when creating cluster with docker.io/databricksruntime/standard:13.3-LTS@sha256:5d1547b9d21c9e80acc03e001533e842ebaf299f4d4b479809260112cc1ec72c or databricksruntime/standard:13.3-LTS@sha256:5d1547b9d21c9e80acc03e001533e842ebaf299f4d4b479809260112cc1ec72c. Instead, I docker pulled the image and published to standard-test:13.3-LTS. As expected, I hit pylsp issue with this old image. Besides, I reset my repo back to 421c2c72d7f613a768fe4e3507180af96957a1b9 (4 months ago) and rebuilt the image, same issue. So I am pretty sure it's DBR issue and will reach out to code owner to take a look.

Just to be 100% sure about my suspect, @kelseyfrancis how do you test DBR with old DCS base image?

kelseyfrancis commented 7 months ago

So I am pretty sure it's DBR issue and will reach out to code owner to take a look.

Thank you for looking into this! I think you are right – I'm now getting the same error again, on a cluster that was already running and working with the old image a few hours ago.

Just to be 100% sure about my suspect, @kelseyfrancis how do you test DBR with old DCS base image?

I tested in both cases with a custom image built FROM databricksruntime/standard:13.3-LTS with some small modifications rather than the base itself. I was able to use the custom image from a couple days ago that was based on @5d1547b9 instead of the current one, and it seemed to be working without issue – nothing changed in that image other than the base – but that must've been a red herring, as the pylsp issue has resurfaced.

xinzhao-db commented 7 months ago

Just an update, I tested the same DCS base image with DBR image from last Dec, it worked. I think it's due to lsp change, will contact the team

kelseyfrancis commented 7 months ago

I think it's due to lsp change

I think you're right again. A workaround that seems to be working now is to install pylsp in the custom image

RUN /databricks/python3/bin/pip install python-lsp-server
xinzhao-db commented 7 months ago

Thanks @kelseyfrancis for verifying!

alexandremoyrand commented 6 months ago

Hello, This issue is closed but is there another one opened on DBR side ?