databricks / databricks-sdk-py

Databricks SDK for Python (Beta)
https://databricks-sdk-py.readthedocs.io/
Apache License 2.0
343 stars 113 forks source link

[ISSUE] lacking consistency of w.dbutils.fs.cp() between SDK and CLI #263

Open vadim opened 1 year ago

vadim commented 1 year ago

Description Lack of parity and unexpected behavior using w.dbutils.fs.cp()

Reproduction

from databricks.sdk import WorkspaceClient
import logging
import os
import subprocess
import sys

logging.basicConfig(stream=sys.stderr,
                    level=logging.INFO,
                    format='%(asctime)s [%(name)s][%(levelname)s] %(message)s')
logging.getLogger('databricks.sdk').setLevel(logging.DEBUG)

w = WorkspaceClient()

src_file = '/tmp/a.file'
dest_file = 'dbfs:/FileStore/a.file'

with open(src_file, 'w') as src:
    src.write('hello local world \n')

'remove destination on DBFS'
w.dbutils.fs.rm(dest_file)

'attempt to copy from local machine to DBFS'
w.dbutils.fs.cp(src_file, dest_file)

Expected behavior I expect w.dbutils.fs.cp() to behave like the databricks CLI command databricks fs cp src dest.

Debug Logs

2023-08-04 07:33:35,985 [databricks.sdk][INFO] loading DEFAULT profile from ~/.databrickscfg: host, token, jobs-api-version
2023-08-04 07:33:35,985 [databricks.sdk][DEBUG] Attempting to configure auth: pat
2023-08-04 07:33:36,393 [databricks.sdk][DEBUG] POST /api/2.0/dbfs/delete
> {
>   "path": "dbfs:/FileStore/a.file",
>   "recursive": false
> }
< 200 OK
< {}
2023-08-04 07:33:36,632 [databricks.sdk][DEBUG] GET /api/2.0/dbfs/get-status?path=/FileStore/a.file
< 404 Not Found
< {
<   "error_code": "RESOURCE_DOES_NOT_EXIST",
<   "message": "No file or directory exists on path /FileStore/a.file."
< }
2023-08-04 07:33:36,974 [databricks.sdk][DEBUG] GET /api/2.0/dbfs/get-status?path=/tmp/a.file
< 404 Not Found
< {
<   "error_code": "RESOURCE_DOES_NOT_EXIST",
<   "message": "No file or directory exists on path /tmp/a.file."
< }
2023-08-04 07:33:37,246 [databricks.sdk][DEBUG] GET /api/2.0/dbfs/get-status?path=/tmp/a.file
< 404 Not Found
< {
<   "error_code": "RESOURCE_DOES_NOT_EXIST",
<   "message": "No file or directory exists on path /tmp/a.file."
< }
Traceback (most recent call last):
  File "/Users/vadim.patsalo/sdk-test/main.py", line 24, in <module>
    w.dbutils.fs.cp(src_file, dest_file)
  File "/Users/vadim.patsalo/.pyenv/versions/3.10.6/lib/python3.10/site-packages/databricks/sdk/dbutils.py", line 43, in cp
    self._dbfs.copy(from_, to, recursive=recurse)
  File "/Users/vadim.patsalo/.pyenv/versions/3.10.6/lib/python3.10/site-packages/databricks/sdk/mixins/files.py", line 371, in copy
    with src.open(read=True) as reader:
  File "/Users/vadim.patsalo/.pyenv/versions/3.10.6/lib/python3.10/site-packages/databricks/sdk/mixins/files.py", line 295, in open
    return self._api.open(self.as_string, read=read, write=write, overwrite=overwrite)
  File "/Users/vadim.patsalo/.pyenv/versions/3.10.6/lib/python3.10/site-packages/databricks/sdk/mixins/files.py", line 315, in open
    return _DbfsIO(self, path, read=read, write=write, overwrite=overwrite)
  File "/Users/vadim.patsalo/.pyenv/versions/3.10.6/lib/python3.10/site-packages/databricks/sdk/mixins/files.py", line 37, in __init__
    if read: self._status = api.get_status(path)
  File "/Users/vadim.patsalo/.pyenv/versions/3.10.6/lib/python3.10/site-packages/databricks/sdk/service/files.py", line 301, in get_status
    json = self._api.do('GET', '/api/2.0/dbfs/get-status', query=query)
  File "/Users/vadim.patsalo/.pyenv/versions/3.10.6/lib/python3.10/site-packages/databricks/sdk/core.py", line 922, in do
    raise self._make_nicer_error(status_code=response.status_code, **payload) from None
databricks.sdk.core.DatabricksError: No file or directory exists on path /tmp/a.file.

Other Information

Additional context N/A

geophpherie commented 3 months ago

I am encountering this error too. I believe the issue is that the .cp command is treating a file like a remote resource but it is in fact a local resource.

https://github.com/databricks/databricks-sdk-py/blob/a714146d9c155dd1e3567475be78623f72028ee0/databricks/sdk/mixins/files.py#L576

    def _path(self, src):
        src = parse.urlparse(str(src))
        if src.scheme and src.scheme not in self.__ALLOWED_SCHEMES:
            raise ValueError(
                f'unsupported scheme "{src.scheme}". DBUtils in the SDK only supports local, root DBFS, and '
                'UC Volumes paths, not external locations or DBFS mount points.')
        if src.scheme == 'file':
            return _LocalPath(src.geturl())
        if src.path.startswith('/Volumes'):
            return _VolumesPath(self._files_api, src.geturl())
        return _DbfsPath(self._dbfs_api, src.geturl())

I believe it might be improperly handling the case where the scheme comes back as ''. It will then automatically treat it as a dbfs file instead of a local file.

One workaround right now would be to create the local file path as a uri (if using a Path object, call the .as_uri() method on it before passing it in as a string. This will append the file schema so it appears as a local file object since urllib will parse it correctly.

EDIT: Maybe another way to look at this, the __ALLOWED_SCHEMES don't account for situations in which the scheme is an empty string '', which happens when you pass a path like /users/myuser/desktop for example.

__ALLOWED_SCHEMES = [None, 'file', 'dbfs']