The Climate and Hazard Program (P3) of ACS intends to build a data catalogue with the following aims:
The 2020 Bushfire Royal Commission recommended ( Recommendation 4.1 – National disaster risk information & Recommendation 4.2 – Common information platforms and shared technologies ) that “Australian, state and territory governments should prioritise the implementation of harmonised data governance and national data standards ” and “should create common information platforms and share technologies to enable collaboration in the production, analysis, access, and exchange of information, data and knowledge about climate and disaster risks.”
The catalogue will initially provide data tools and accompanying tutorial documentation for ACS staff working on Australian NCI resources with Level B: 'shared data' (project ia39), but can offer a template that could be adapted for both Level C: 'public data’ at NCI and other future publicly available data sources.
intake-esm
sub-catalogues - one for each “product” / “experiment” / “model” data source.ecgtools
.intake-dataframe-catalog
.intake-dataframe-catalog
.The approach should be informed by the principles that will underpin the ACS data governance framework being developed by the ACS Data Governance Group:
netcdf files
corresponding to a specific variable, model, and time period and compute simple seasonal climatology using small resources.
https://github.com/AusClimateService/data-catalogue/blob/main/notebooks/ACS-catalogue-demo.ipynb
ACS-catalogue-demo.ipynb
: example workflow that uses nested intake-esm
catalogues in a root intake-dataframe-catalog
to search across over 76,000 netcdf
file paths and 116TB of data from 2 CCAM runs. Demonstrates the basics of searching the root catalogue to find the needed data source and filtering that down to the 64 netcdf
files required for the variable, type of run, and time period of choice. Using a "small" ARE cluster at NCI ( 2 CPU & 9GB RAM ) we find and load the selected 1.6 GB of data and a few basic calculations and plots are made.
intake-esm
sub-catalogueExample workflows that show how individual sub-catalogues could be built and maintained
Example workflow that shows how a root ACS catalogue could be built and maintained that nests individual sub-catalogues.
An ACS-demo-environment
is provided. Note current ( as of May 2023 ) issue with needing to pin netcdf4
to 1.6.0
https://github.com/AusClimateService/data-catalogue/blob/main/ACS-demo-environment.yml
Repo: https://github.com/AusClimateService/data-catalogue
intake
: https://intake.readthedocs.io/en/latest/
intake-esm
: https://intake-esm.readthedocs.io/en/stable/
ecgtools
: https://ecgtools.readthedocs.io/en/latest/
intake-dataframe-catalog
: https://intake-dataframe-catalog.readthedocs.io/en/latest/
intake
?Taking the pain out of data access and distribution
Intake is a lightweight package for finding, investigating, loading and disseminating data. It will appeal to different groups, but is useful for all and acts as a common platform that everyone can use to smooth the progression of data from developers and providers to users.
intake-esm
?A data cataloging utility built on top of intake, pandas, and xarray
Computer simulations of the Earth’s climate and weather generate huge amounts of data. These data are often persisted on HPC systems or in the cloud across multiple data assets of a variety of formats (netCDF, zarr, etc…). Finding, investigating, loading these data assets into compute-ready data containers costs time and effort. The data user needs to know what data sets are available, the attributes describing each data set, before loading a specific data set and analyzing it.
Finding, investigating, loading these assets into data array containers such as xarray can be a daunting task due to the large number of files a user may be interested in. Intake-esm aims to address these issues by providing necessary functionality for searching, discovering, data access/loading.
ecgtools
?ESM Catalog Generation tools
The critical requirement for using [intake-esm](https://github.com/intake/intake-esm)
is having a data catalog. ecgtools
package enables you build data catalogs to be read in by [intake-esm](https://github.com/intake/intake-esm)
, which enables a user to easily search, discover, and access datasets they are interested in using.
intake-dataframe-catalog
?A simple intake plugin for a searchable table of intake sources and associated metadata.
Intake already provides the ability to nest sources in a catalog and search across them. However, data discoverability is limited in the case of very large numbers of nested sources, and the search functionality does not readily provide the ability to execute complex searches on nested source metadata. intake-dataframe-catalog aims to provide a very simple catalog of intake sources that emphasises source search and discoverability.