access s3 content from Python

IBM / jupyterlab-s3-browser

A JupyterLab extension for browsing S3-compatible object storage

Apache License 2.0

125 stars 42 forks source link

access s3 content from Python #79

Closed PhE closed 1 year ago

PhE commented 2 years ago

The S3 content is accessible with the browser pane on the left. But the S3 content is not visible from the Python code inside the Notebook.

If I have a notebook.ipynb along with a data.txtfile in the same folder, the following code in the notebook will fail :

open('data.txt').read()

I understand the S3 content can't be exposed as a filesystem to the Python kernel. But we should have a way to access the S3 content from Python.

PhE commented 2 years ago

I managed to access the S3 content from my notebook with the s3fs module :

s3 = s3fs.S3FileSystem(
        key = os.environ['JUPYTERLAB_S3_ACCESS_KEY_ID'],
        secret = os.environ['JUPYTERLAB_S3_SECRET_ACCESS_KEY'],
        anon = False,
        client_kwargs = {'endpoint_url':os.environ['JUPYTERLAB_S3_ENDPOINT']},
)
s3.ls('my-bucket')

It's working but is there a simplier way to get this s3fs access ?

I don't want my users to deal with the S3 credentials if they want to access the S3 content from their notebook.

TerenceLiu98 commented 1 year ago

@PhE I also facing this problem. Have you solved this problem? If yes, could you please share the solution? Thanks! :)

PhE commented 1 year ago

@TerenceLiu98 I solved it with a different approach : I use rclone to mount the S3 bucket. I start a background process with the rclone command. In my case this is a kubernetes pod with 2 containers (one for rclone, the other for jupyter).

It is more stable than s3fs and the users can browse it as usual.

PhE commented 1 year ago

Found the described workaround.

TerenceLiu98 commented 1 year ago

@TerenceLiu98 I solved it with a different approach : I use rclone to mount the S3 bucket. I start a background process with the rclone command. In my case this is a kubernetes pod with 2 containers (one for rclone, the other for jupyter).

It is more stable than s3fs and the users can browse it as usual.

I use a similar way to solve the problem as well, instead of rclone, I use juicefs, however, my environment is only Docker so I could not bind them into one pod. However juicefs needs the privileged option as it is FUSE, is rclone the same?

PhE commented 1 year ago

rclone does not require the privileged option. A simple rclone mount is enough. We also use rclone sync as a cheap local file cache. A good point, compared to juicefs, is that the file/folder hierarchy is conserved in the S3 bucket. I mean what you see in the bucket are the real files and folders names and not some cryptic unusable chunks.

TerenceLiu98 commented 1 year ago

rclone does not require the privileged option. A simple rclone mount is enough. We also use rclone sync as a cheap local file cache. A good point, compared to juicefs, is that the file/folder hierarchy is conserved in the S3 bucket. I mean what you see in the bucket are the real files and folders names and not some cryptic unusable chunks.

I found a CSI driver - (k8s-csi-s3) using S3 as the StorageClass and using geesefs (which may have better performance than the rclone) for POSIX. It is good for the usage scenario where combining jupyterlab and S3. I have tried both of the k8s-csi-s3 and juicefs-csidriver, and both of them are working well; however, the juicefs need an extra database for the metadata storage.