TGSAI / mdio-cpp

C++, Cloud native, scalable storage engine for various types of energy data.
Apache License 2.0
6 stars 3 forks source link
deep-learning energy machine-learning tensorstore zarr

License

C/C++ build clang-format check CodeQL Analysis

Welcome to MDIO - a descriptive format for energy data that is intended to reduce storage costs, improve the efficiency of I/O and make energy data and workflows understandable and reproducible.

MDIO schema definitions here.

Requied tools

Optional tools (Code quality control)

Optional tools (Integration)

Getting Started

First clone the MDIO v1.0 library:

This project uses CMake for the build and requires CMake 3.24 or better to build. The project build is configured to use the fetch and install it 3rd party dependencies. To build MDIO, clone the repos and create a build directory:

$ mkdir build
$ cd build
# NOTE: "CMake Deprecation Warning at build/_deps/nlohmann_json_schema_validator-src/CMakeLists.txt:1" can safely be ignored
$ cmake ..

Each MDIO target has the prefix "mdio" in its name, to build the tests run the following commands from the build directory:

$ make -j32 mdio_acceptance_test

The acceptance test will validate that the MDIO/C++ data can be read by Python's Xarray. To ensure that the test passes, make sure your Python environment has Xarray install, and run the acceptance test:

$ cd build/mdio/
$ ./mdio_acceptance_test

The dataset and variables have their own test suite too:

$ make -j32 mdio_variable_test
$ make -j32 mdio_dataset_test

Each MDIO library will provide an associated cmake alias, e.g. mdio::mdio which can be use to link against MDIO in your project.

API Documentation

MDIO API documentation is currently provided with the MDIO library.

open mdio/docs/html/index.html

Key Features

Project Vision

Our vision is to provide a tool that not only simplifies the management of energy data but also enhances the quality and depth of energy analysis. By keeping units, dimensions, and other critical metadata with the data, MDIO ensures that every dataset is not just a collection of numbers but a rich, self-explaining narrative of energy insights.

Target Audience

MDIO is built for a wide range of users, including:

Project Roadmap

Phase 1: Adoption, bug Fixes and stability

Phase 2: I/O Performance Optimization

Phase 3: Cost Reduction and Efficiency

Phase 4: Feature Completeness and Compliance

Phase 5: Process Optimization

(dependency) Tensorstore

We use the tensorstore library to provide native a C/C++ interface to ZArr. If you're familiar with the Python DASK library, tensorstore has very similar semantics when it comes to manipulating data and creating asynchronous execution.

Tensorstore is used under an Apache 2.0 license.

Relevant features of the Tensorstore library are:

  1. Read/write ZArr data in memory, from disk, with GCFS buckets (Google file system).
  2. Encode/decode data with some basic data compression BLOCS, zlib, lz4, zstd and jpeg.
  3. Concurrency; multi-threaded ACID reads/writes.
  4. Objects designed with async futures/promises architecture.
  5. Logical array slicing operations.
  6. Basic iterators.
  7. Chunk aligned iterators.
  8. Informative error messages and exception handling.

Nice to have features of Tensorstore:

  1. A companion Python library.
  2. Transactions, used to stage groups of modifications.
  3. Caching.
  4. Progress monitoring.
  5. Abstraction over the Tensorstore "driver", read generic array data from buckets.

(dependency) Patrick Boettcher's JSON schema validator

We use the json-schema-validator library to validate MDIO schemas against the schema definitions.

This library is used under the MIT license.

Authors