Sbozzolo commented 9 months ago

The Climate Modeling Alliance: Workflow and technology to manage input data (OKR 2.3.7, 2.3.9, 3.4.1)

Software Design Issue 📜

Purpose

This SDI describes challenges and proposes a solution on how to manage input data. This SDI expands the policy document drafted by @sbryne addressing the issues with large datasets and parallel runs.

Science and technical goals

Data should come with comprehensive metadata that includes sources, how the data was prepared, a version number, URLs, ...
Datasets should be reproducible, ideally from the original source with a single automated script
Data distribution should respect the licenses and additional requests attached to the data (e.g., if the data has to be acknowledged in a specific way)
It should be possible to run a simulation without having the data for components that are not used in that particular simulation
It should be possible to list all the datasets used in a particular simulation without having to dig through the source code
Users should not have to worry about downloading data, unless the size is larger than a given value
Datasets should be stored on reliable archival servers that are freely accessible and not tied to anyone's subscription

The technical solution

Background

The Julia language supports artifacts, pieces of non-source-code data that have to be distributed with a package. In Julia, artifacts are primarily used to distribute binary component that are needed to run a package (e.g., compiled libraries). Using the "official" data-distribution stream for Julia comes with several benefits that automatically check some of the boxes above, such as integrity verification and download on-demand.

Julia artifacts are tarballs declaratively defined in a package Artifact.toml file. They are accessed via the @artifact_str macro, which ensures that the artifact is present when it needs to be used. This might require downloading the artifact upon instantiation of the package (for greedy artifacts), or downloading it when needed (for lazy artifacts). Users do not have to worry about the artifacts being present or their location as everything is handled behind the scenes.

The Artifact.toml can contain additional fields that can be queried through the Pkg.Artifact APIs. For example, we can add the acknowledgment field to generate automatically a report on what has to be acknowledged for the datasets used for a given simulation. We do not need to about this too much now, but it is good to know that we are working with an extensible framework.

Building upon Artifacts

While Julia artifacts do not have all the capabilities we need, they provide a good starting point to build our solution. The three main shortcomings of Julia artifacts are that they don't handle large files, MPI runs, and they cannot process additional runtime information (e.g., if they are accessed). Pkg.Artifacts is modular enought that we implement the missing features.

This SDI proposes to develop a new (small) package ClimaArtifacts. ClimaArtifacts mainly provides two functionalities:

A new Julia artifact macro, @clima_artifact_str, which fills the gaps in @artifact_str, by handling parallel runs, artifact tagging, and all the complexity that we need but is not captured by Pkg.Artifacts
Tools to query and produce reports on the datasets used in a given simulation (e.g., which datasets were used, which papers have to be cited, ...) and other user-friendly utilities (e.g., to simplify the creation of an artifact). Only the tools that are currently needed will be developed at this time.

Large datasets (and compute clusters/CI)

Large datasets always require special treatment (where "large" is yet to be quantified). For instance, we cannot assume that any one folder is large enough to contain 1 TB of data. Instead, we need to the user take care of obtaining and storing that data where they see fit. Luckily, this can be handled directly with Pkg.Artifact.

The Julia artifact system provides a reasonably good way to connect external data with the artifact system (albeit, a little clunky to set up). Julia artifacts were developed to be distribute system libraries/binaries, but sometimes it is best to use the native library/binary on the system. To achieve this, Julia artifacts can be overridden with an Overrides.toml that specifies the path of a given artifact.

Our packages will keep track the required artifacts in the Artifacts.toml but those that are too large will be marked as "not-downloadable" (by removing the .download section).

Let us look at an example

Artifact.toml

[rrtgmp]
git-tree-sha1 = "43563e7631a7eafae1f9f8d9d332e3de44ad7239"

   [[rrtgmp.download]]
   url = "https://data.caltech.edu/records/ppv8a-4q131/files/rrtmgp-lookup-data.tar.gz"
   sha256 = "e65d2f13f2085f2c279830e863292312a72930fee5ba3c792b14c33ce5c5cc58"

[era5]
git-tree-sha1 = "43563e7631a7eafae1f9f8d9d332e3de44ad7210"

The era5 artifact here cannot be downloaded because it does not have a .download section. Trying to use that dataset will yield an error. To use era5, users will have add a Overrides.toml in their Julia depot:

43563e7631a7eafae1f9f8d9d332e3de44ad7210 = "/groups/esm/data/era5"

The clunkiness in setting this up comes from the fact that everything is done with hashes. The benefit of using hashes is that multiple packages that use the same dataset will share the same files instead of requiring copies. A question and potential problem that has not been investigated yet is related to this. Pkg.Artifacts verifies the integrity of artifacts by computing and matching a cryptography hash. It has to be seen how this works when the artifact is a 1 TB folder. We suspect that the override system will automatically take care of this.

The Caltech cluster

On the Caltech cluster, several repositories use shared depots, so Overrides files are shared across users. Once set up, everything should just work.

At this point, we can use /groups/esm to store the data we need for runs. It has 30 TB of total space (24 TB currently used) and it is fast enough for our current needs. /groups/esm does not have purge policies.

Distributed computing is difficult

While handling large data files can be dealt with with relative ease, artifacts for parallel runs are much more complex. When we run our codes with MPI, multiple copies of the same executable are running at the same time. This means that if a piece of data is not available and has to be downloaded, multiple processes will try to fetch the data and store it in the same location. This is problematic for at least two reasons: (1) we don't want 100s of identical network requests at the same time, (2) we don't want 100s of processes to try to write the same file in the same place. Number (1) is good etiquette and might lead to performance degradation for everyone on the system (and for slow servers, as the ones that might host scientific data, also everyone using the server). On a good day, number (2) also leads to performance degradation, but most generally it leads to race conditions and kills the run.

The ideal solution to this problem is that only the root process downloads the data. In turn, this requires extending the artifact system to be MPI-aware. This is the key reason why we need a new module.

Some of the implementation details

The core logic of how artifacts are handled in Julia can be found here. The key is that @artifact_str does some basic processing and calls the (private) _artifact_str function for most of the heavy lifting. ClimaArtifacts will provide a new @artifact_str that does additional processing before invoking _artifact_str. @clima_artifact_str will be called with a small additional payload that contains the ClimaComms context (plus any additional structs that might be needed for tracking). When the context is not MPI, @clima_artifact_str will just call _artifact_str. When the context is MPI, @clima_artifact_str will first have the root process download the required artifact, and then call _artifact_str, ensuring that no additional download is performed.

Public and archival data storage

CaltechDATA offers free archival storage for datasets below 500 GB (and paid storage for larger datasets). We already have a community there. CaltechDATA seem to use Amazon S3 as backend and my first benchmarks seem to indicate that it is reasonably fast download speeds.

Those that wish to upload their data to CaltechDATA need to:

Curate their artifact
Make their profile visible on CaltechDATA from the settings
Get in touch with me or @szy21 to be added to the community
Accept the invitation to be part of the clima community

Process to add a new artifact

This section will be updated as the details become more clear

Create a project and script for creating the data in the repository
- e.g. artifacts/<artifact-name>/create_artifact.jl
- Cleaning/preprocessing into a convenient format
- Pkg.Artifacts provides some helpful functions (ClimaArtifacts could automate some of this work)
  - create_artiface
  - archive_artifact
Check in the Project.toml and Manifest.toml
Add a README.md
- What the artifact contains (files, links to sources)
- Who created it, relevant citations
- Any preprocessing performed
- Include this in the archive
Upload and update the Artifacts.toml file.

Cost/Benefits/Risks

The main risk associated to the proposed solution is that we are adding a layer in between @artifact_str and _artifact_str. Future versions of Julia might change how this is implemented, leading to some maintenance work required.

The key benefit provided is that datasets will be handled in a unified (and for the most part transparent) way. In the process, we will document and list all the datasets that are being used. This will also fix issues as https://github.com/CliMA/ClimaLSM.jl/issues/423.

With this infrastructure, we will also be able to provide precomputed artifacts (e.g., topo maps already regridded to standard resolutions) that will speed up startup time.

People and Personnel

Lead: @Sbozzolo

Task Breakdown And Schedule

Next actions:

Prototype technical solution
- Test case with MPI run and lazy artifact
- Test case with larger dataset on Caltech cluster
Quantify what "large data" is

Go through all the repositories, compile list of all the artifacts

- [ ] #580 
- [ ] https://github.com/CliMA/ClimaAtmos.jl/issues/2887
- [ ] ClimaCoupler
- [ ] ClimaCore
- [ ] ClimaOcean
- [ ] SurfaceFluxes
- [ ] Thermodynamics
- [ ] RRTGMP
- [ ] Insolation
- [ ] More repos

Find people responsible for artifacts, have them document and provide a script to generate the artifact according to the policy
When artifact metadata is submitted, move them to the new system (upload them on CaltechData and use ClimaArtifacts)

CC

@tapios @simonbyrne @cmbengue

tapios commented 9 months ago

Thank you for the thorough and detailed plan! It looks good.

Could you please add some rough time estimates for the tasks (especially the first bullet)?

Sbozzolo commented 9 months ago

Thank you for the thorough and detailed plan! It looks good.

Could you please add some rough time estimates for the tasks (especially the first bullet)?

If no unexpected problems come up, the first bullet point will take one day of work for a draft implementation. We might have something by the end of this week.

What will take much longer is collecting the census of all the artifacts alongside their metadata (and the script needed to generate the datasets). However, this can be done in a perfectly incremental and non-disruptive way.

tapios commented 9 months ago

Sounds great!

Sbozzolo commented 7 months ago

This is implemented in ClimaArtifacts and ClimaUtilities.

As a test, I uploaded the fluxnet2015 data (41 GB) to the cluster. Any user that uses climacommon can access the data with

using Artifacts
fluxnet_data_folder = artifact"flunext2015"

as long as flunxet2015 is defined in the Artifacts.toml of their package. Accessing artifacts this way has 0 runtime cost. However, the preferred way to access artifacts is

using ClimaUtilities.ClimaArtifacts
fluxnet_data_folder = @clima_artifact("flunext2015"[, context])

When provided the ClimaComms context, this is MPI-safe. This has nearly 0 runtime costs, but it also keeps track on what artifacts are being accessed in a given simulation.

Among the other things, ClimaArtifacts:

fixes the races conditions associates to MPI
fixes the races conditions associated to Buildkite (when the artifact is uploaded with climacommon)
uses the "official" Julian way to handle external data, fixing the problems associated to ArtifactsWrappers (including non-determinism and corruption)
centralizes and cleans the artifacts we produce and use (artifacts will be required to have a script that reproduces them) and allows different packages to share the same artifacts
automatcally verifies the integrity of downloaded artifacts
provides a path to adding more metadata to the input data if we wanted to (e.g., license/required acknowledgments)

A user/developer guide is provided in the README. ClimaArtifacts also provides an interactive guided procedure to create new artifacts.

The next step in this is having someone who is not me trying to add and use a new artifact using the provided documentation.

CliMA / ClimaLand.jl

SDI: Workflow and technology to manage input data (OKR 2.3.7, 2.3.9, 3.4.1) #467