bcdev / nc2zarr

A Python tool that converts NetCDF files to Zarr format
MIT License
9 stars 3 forks source link

Allow private S3 bucket as inputs #60

Closed vietnguyengit closed 1 year ago

vietnguyengit commented 2 years ago

I hope this contribution will be helpful and as a thank you for providing nc2zarr as open-source.

I needed to load NetCDF files from a private S3 bucket and was able to solve that with the changes in this commit.

If NetCDF files are read from a public bucket, just need to change anon=True, then re-issue python setup.py develop

s3 = s3fs.S3FileSystem(anon=False)

I believe it'll be possible to make S3 as inputs feature comprehensive just like what nc2zarr can currently do with S3 as outputs. However, it's outside the scope of my intention so I didn't invest much time in it.

I deployed and ran nc2zarr directly from an AWS SageMaker notebook that has read/write access to the target private S3 bucket. If accessing the S3 bucket from a different workstation, just need to follow AWS authentication guidelines (e.g. ~/.aws/credentials and should be fine, make sure to set up the environment variables export AWS_PROFILE=<profile-name> (~/.aws/config) if specific IAM roles required before running nc2zarr.

With these changes, S3 links can be provided to the yml config files either as:

Thank you and have a lovely day! - From Tasmania, Australia

vietnguyengit commented 2 years ago

Hey @forman, thanks for the review. No worries at all, yes the changes were specialised for my need, however, it would be a nice feature to have as NetCDF files on the S3 bucket are common. I'll generalise the cases and tidy up the code after finishing the project we need nc2zarr for now.

vietnguyengit commented 2 years ago

Changed this PR to draft.

forman commented 2 years ago

Hey @forman, thanks for the review. No worries at all, yes the changes were specialised for my need, however, it would be a nice feature to have as NetCDF files on the S3 bucket are common.

@pont-us and I fully agree that this is a nice feature. I also wasn't aware that xr.open_dataset() with the HDF5 driver can read NetCDF directly from S3 - very good.

I'll generalise the cases and tidy up the code after finishing the project we need nc2zarr for now.

Great, we'll guide you.

vietnguyengit commented 2 years ago

No worries, yeah that's why I have to explicitly declare h5netcdf for engine if the files are from an S3 bucket. We'll keep in touch.