Explore conversion to zarr

aaarendt commented 5 years ago

Converting from NetCDF to zarr will enable us to make use of distributed workflows in Pangeo. Can we explore converting the MAR data? A few options include:

using the to_zarr method in xarray
build on @jonahjoughin's work here

liuzheng-arctic commented 5 years ago

I have convert the NetCDF files of the MAR data to Zarr under s3://pangeo-data-upload-oregon/icesat2/HMA_Validation/Zarr/MAR/ I do have several questions. (1) Are we aiming to have our python code to perform the conversion (a) from local to s3; or (b) from s3 to s3? I used xr.to_zarr to convert the local copy of the MAR data on pangeo to s3, or the approach (a). I tried to read the MAR data on s3 for the conversion but xr.load_dataset is not working properly, even with 'engine=h5netcdf'. I have seen mixed results from other people trying to read netcdf with xarray directly from s3. I can use h5py to access them but the conversion to zarr will take extra work. (2) The choice of chunk size of the zarr data. I am not sure how this data is going to be used so I used a random choice for chunk size, with chunks={'TIME':31,'X11_210':50,'Y11_190':30}. This turns the original data slice to 12x4x6 chunks. It takes 12 min to convert one file. Using smaller chuncks will increase the processing time accordingly. (3) Parallelization. The conversion is done sequentially so far. I will look into using dask or joblib for the conversion later.

aaarendt commented 5 years ago

Some of the files I'm acquiring soon will be too large to work with locally, so we'll need to explore options for doing the conversion entirely in the cloud.

@jonahjoughin has direct experience with this and I hope can assist in answering your questions on conversion methods and optimizing the chunk sizes.

@liuzheng-arctic can you push your code for doing the conversion to this repo? I'd like to do some testing but am unsure what the name of the zarr store is.

liuzheng-arctic commented 5 years ago

I pushed the conversion script to my branch. About convert netcdf to zarr directly on cloud. It seems the issue is with the HDF5 library. I can open netcdf files on s3 and get metadata using h5py but getting even the 1d "TIME" variable takes forever. It seems to be downloading the whole file to memory. According to an article last year from the AWS big data blog, HDF5 library has to download the entire file from s3 before they can read the actual data. I am not sure how much has changed since then. Here is the link https://aws.amazon.com/blogs/big-data/power-from-wind-open-data-on-aws/ There seems to be work around using some experimenting libraries http://matthewrocklin.com/blog/work/2018/02/06/hdf-in-the-cloud

I just found out that with the current version of zarr on pangeo (2.2.0), metadata in netcdf file has to be packed into encoding. I tried @jonahjoughin 's code (local to s3) and it runs smoothly but zarr stores are not created on s3. I will have to look into it later. @jonahjoughin 's code has the capability to convert s3 files but it also downloads the whole file to local before conversion.

liuzheng-arctic commented 5 years ago

Do I need some credentials when accessing the s3 in python? If I have check=True in zstore = s3fs.S3Map(root=zpath, s3=fs, check=False) I got error message ClientError: An error occurred (AccessDenied) when calling the ListBuckets operation: Access Denied If check=True, the error message is gone. This is just a check for writability by a touch. I would expect the opposite if I have access issue. If I ignore it and use check=True, I can write the zarr dataset to s3. But when I try to read it with ds_zarr = zarr.open(store=zstore) Here I use 'zstore='s3://pangeo-data-upload-oregon/icesat2/HMA_Validation/Zarr_test/MAR/HMA_MAR3_5_ICE.2000.01-12.h22.nc.zarr' The ds_zarr is empty as if I give the wrong store name. I tried the conversion locally and there is no problems for zarr to open the converted dataset. So I must have done something wrong handling s3.

aaarendt / HMA_Validation

Explore conversion to zarr #2