AODN Cloud Optimised Conversion
A tool designed to convert IMOS NetCDF and CSV files into Cloud Optimised formats such as Zarr and Parquet
Documentation
Visit the documentation on ReadTheDocs for detailed information.
Key Features
- Conversion of CSV/NetCDF to Cloud Optimised format (Zarr/Parquet)
- Clustering capability:
- Local dask cluster
- Remote Coiled cluster
- driven by configuration/can be easily overwritten
- Zarr: gridded dataset are done in batch and in parallel with xarray.open_mfdataset
- Parquet: tabular files are done in batch and in parallel as independent task, done with future
- Reprocessing:
- Zarr,: reprocessing is achieved by writting to specific regions with slices. Non-contigous regions are handled
- Parquet: reprocessing is done via pyarrow internal overwritting function, but can also be forced in case an input file has significantly changed
- Chunking:
- Parquet: to facilitate the query of geospatial data, polygon and timestamp slices are created as partitions
- Zarr: done via dataset configuration
- Metadata:
- Parquet: Metadata is created as a sidecar _metadata.parquet file
- Unittesting of module: Very close to integration testing, local cluster is used to create cloud optimised files
Quick Guide
Installation
Requirements:
- Python >= 3.10.14
- AWS SSO to push files to S3
- An account on Coiled for remote clustering (Optional)
Automatic installation of the latest wheel release
curl -s https://raw.githubusercontent.com/aodn/aodn_cloud_optimised/main/install.sh | bash
Otherwise, go to the release page.
Development
See ReadTheDocs - Dev
Usage
See ReadTheDocs - Usage
Notebooks
A curated list of Jupyter Notebooks ready to be loaded in Google Colab and Binder. Click on the badge above