IBM / jupyterlab-s3-browser

A JupyterLab extension for browsing S3-compatible object storage
Apache License 2.0
119 stars 41 forks source link

access s3 content from Python #79

Closed PhE closed 1 year ago

PhE commented 2 years ago

The S3 content is accessible with the browser pane on the left. But the S3 content is not visible from the Python code inside the Notebook.

If I have a notebook.ipynb along with a data.txtfile in the same folder, the following code in the notebook will fail :

open('data.txt').read()

I understand the S3 content can't be exposed as a filesystem to the Python kernel. But we should have a way to access the S3 content from Python.

PhE commented 2 years ago

I managed to access the S3 content from my notebook with the s3fs module :

s3 = s3fs.S3FileSystem(
        key = os.environ['JUPYTERLAB_S3_ACCESS_KEY_ID'],
        secret = os.environ['JUPYTERLAB_S3_SECRET_ACCESS_KEY'],
        anon = False,
        client_kwargs = {'endpoint_url':os.environ['JUPYTERLAB_S3_ENDPOINT']},
)
s3.ls('my-bucket')  

It's working but is there a simplier way to get this s3fs access ?

I don't want my users to deal with the S3 credentials if they want to access the S3 content from their notebook.

TerenceLiu98 commented 1 year ago

@PhE I also facing this problem. Have you solved this problem? If yes, could you please share the solution? Thanks! :)

PhE commented 1 year ago

@TerenceLiu98 I solved it with a different approach : I use rclone to mount the S3 bucket. I start a background process with the rclone command. In my case this is a kubernetes pod with 2 containers (one for rclone, the other for jupyter).

It is more stable than s3fs and the users can browse it as usual.

PhE commented 1 year ago

Found the described workaround.

TerenceLiu98 commented 1 year ago

@TerenceLiu98 I solved it with a different approach : I use rclone to mount the S3 bucket. I start a background process with the rclone command. In my case this is a kubernetes pod with 2 containers (one for rclone, the other for jupyter).

It is more stable than s3fs and the users can browse it as usual.

I use a similar way to solve the problem as well, instead of rclone, I use juicefs, however, my environment is only Docker so I could not bind them into one pod. However juicefs needs the privileged option as it is FUSE, is rclone the same?

PhE commented 1 year ago

rclone does not require the privileged option. A simple rclone mount is enough. We also use rclone sync as a cheap local file cache. A good point, compared to juicefs, is that the file/folder hierarchy is conserved in the S3 bucket. I mean what you see in the bucket are the real files and folders names and not some cryptic unusable chunks.

TerenceLiu98 commented 11 months ago

rclone does not require the privileged option. A simple rclone mount is enough. We also use rclone sync as a cheap local file cache. A good point, compared to juicefs, is that the file/folder hierarchy is conserved in the S3 bucket. I mean what you see in the bucket are the real files and folders names and not some cryptic unusable chunks.

I found a CSI driver - (k8s-csi-s3) using S3 as the StorageClass and using geesefs (which may have better performance than the rclone) for POSIX. It is good for the usage scenario where combining jupyterlab and S3. I have tried both of the k8s-csi-s3 and juicefs-csidriver, and both of them are working well; however, the juicefs need an extra database for the metadata storage.