CliMA / ClimaLand.jl

Clima's Land Model
Apache License 2.0
36 stars 9 forks source link

SDI: Workflow and technology to manage input data (OKR 2.3.7, 2.3.9, 3.4.1) #467

Open Sbozzolo opened 8 months ago

Sbozzolo commented 8 months ago

The Climate Modeling Alliance: Workflow and technology to manage input data (OKR 2.3.7, 2.3.9, 3.4.1)

Software Design Issue 📜

Purpose

This SDI describes challenges and proposes a solution on how to manage input data. This SDI expands the policy document drafted by @sbryne addressing the issues with large datasets and parallel runs.

Science and technical goals

The technical solution

Background

The Julia language supports artifacts, pieces of non-source-code data that have to be distributed with a package. In Julia, artifacts are primarily used to distribute binary component that are needed to run a package (e.g., compiled libraries). Using the "official" data-distribution stream for Julia comes with several benefits that automatically check some of the boxes above, such as integrity verification and download on-demand.

Julia artifacts are tarballs declaratively defined in a package Artifact.toml file. They are accessed via the @artifact_str macro, which ensures that the artifact is present when it needs to be used. This might require downloading the artifact upon instantiation of the package (for greedy artifacts), or downloading it when needed (for lazy artifacts). Users do not have to worry about the artifacts being present or their location as everything is handled behind the scenes.

The Artifact.toml can contain additional fields that can be queried through the Pkg.Artifact APIs. For example, we can add the acknowledgment field to generate automatically a report on what has to be acknowledged for the datasets used for a given simulation. We do not need to about this too much now, but it is good to know that we are working with an extensible framework.

Building upon Artifacts

While Julia artifacts do not have all the capabilities we need, they provide a good starting point to build our solution. The three main shortcomings of Julia artifacts are that they don't handle large files, MPI runs, and they cannot process additional runtime information (e.g., if they are accessed). Pkg.Artifacts is modular enought that we implement the missing features.

This SDI proposes to develop a new (small) package ClimaArtifacts. ClimaArtifacts mainly provides two functionalities:

Large datasets (and compute clusters/CI)

Large datasets always require special treatment (where "large" is yet to be quantified). For instance, we cannot assume that any one folder is large enough to contain 1 TB of data. Instead, we need to the user take care of obtaining and storing that data where they see fit. Luckily, this can be handled directly with Pkg.Artifact.

The Julia artifact system provides a reasonably good way to connect external data with the artifact system (albeit, a little clunky to set up). Julia artifacts were developed to be distribute system libraries/binaries, but sometimes it is best to use the native library/binary on the system. To achieve this, Julia artifacts can be overridden with an Overrides.toml that specifies the path of a given artifact.

Our packages will keep track the required artifacts in the Artifacts.toml but those that are too large will be marked as "not-downloadable" (by removing the .download section).

Let us look at an example

Artifact.toml

[rrtgmp]
git-tree-sha1 = "43563e7631a7eafae1f9f8d9d332e3de44ad7239"

   [[rrtgmp.download]]
   url = "https://data.caltech.edu/records/ppv8a-4q131/files/rrtmgp-lookup-data.tar.gz"
   sha256 = "e65d2f13f2085f2c279830e863292312a72930fee5ba3c792b14c33ce5c5cc58"

[era5]
git-tree-sha1 = "43563e7631a7eafae1f9f8d9d332e3de44ad7210"

The era5 artifact here cannot be downloaded because it does not have a .download section. Trying to use that dataset will yield an error. To use era5, users will have add a Overrides.toml in their Julia depot:

43563e7631a7eafae1f9f8d9d332e3de44ad7210 = "/groups/esm/data/era5"

The clunkiness in setting this up comes from the fact that everything is done with hashes. The benefit of using hashes is that multiple packages that use the same dataset will share the same files instead of requiring copies. A question and potential problem that has not been investigated yet is related to this. Pkg.Artifacts verifies the integrity of artifacts by computing and matching a cryptography hash. It has to be seen how this works when the artifact is a 1 TB folder. We suspect that the override system will automatically take care of this.

The Caltech cluster

On the Caltech cluster, several repositories use shared depots, so Overrides files are shared across users. Once set up, everything should just work.

At this point, we can use /groups/esm to store the data we need for runs. It has 30 TB of total space (24 TB currently used) and it is fast enough for our current needs. /groups/esm does not have purge policies.

Distributed computing is difficult

While handling large data files can be dealt with with relative ease, artifacts for parallel runs are much more complex. When we run our codes with MPI, multiple copies of the same executable are running at the same time. This means that if a piece of data is not available and has to be downloaded, multiple processes will try to fetch the data and store it in the same location. This is problematic for at least two reasons: (1) we don't want 100s of identical network requests at the same time, (2) we don't want 100s of processes to try to write the same file in the same place. Number (1) is good etiquette and might lead to performance degradation for everyone on the system (and for slow servers, as the ones that might host scientific data, also everyone using the server). On a good day, number (2) also leads to performance degradation, but most generally it leads to race conditions and kills the run.

The ideal solution to this problem is that only the root process downloads the data. In turn, this requires extending the artifact system to be MPI-aware. This is the key reason why we need a new module.

Some of the implementation details

The core logic of how artifacts are handled in Julia can be found here. The key is that @artifact_str does some basic processing and calls the (private) _artifact_str function for most of the heavy lifting. ClimaArtifacts will provide a new @artifact_str that does additional processing before invoking _artifact_str. @clima_artifact_str will be called with a small additional payload that contains the ClimaComms context (plus any additional structs that might be needed for tracking). When the context is not MPI, @clima_artifact_str will just call _artifact_str. When the context is MPI, @clima_artifact_str will first have the root process download the required artifact, and then call _artifact_str, ensuring that no additional download is performed.

Public and archival data storage

CaltechDATA offers free archival storage for datasets below 500 GB (and paid storage for larger datasets). We already have a community there. CaltechDATA seem to use Amazon S3 as backend and my first benchmarks seem to indicate that it is reasonably fast download speeds.

Those that wish to upload their data to CaltechDATA need to:

  1. Curate their artifact
  2. Make their profile visible on CaltechDATA from the settings
  3. Get in touch with me or @szy21 to be added to the community
  4. Accept the invitation to be part of the clima community

Process to add a new artifact

This section will be updated as the details become more clear

  1. Create a project and script for creating the data in the repository
    • e.g. artifacts/<artifact-name>/create_artifact.jl
    • Cleaning/preprocessing into a convenient format
    • Pkg.Artifacts provides some helpful functions (ClimaArtifacts could automate some of this work)
  2. Check in the Project.toml and Manifest.toml
  3. Add a README.md
    • What the artifact contains (files, links to sources)
    • Who created it, relevant citations
    • Any preprocessing performed
    • Include this in the archive
  4. Upload and update the Artifacts.toml file.

Cost/Benefits/Risks

The main risk associated to the proposed solution is that we are adding a layer in between @artifact_str and _artifact_str. Future versions of Julia might change how this is implemented, leading to some maintenance work required.

The key benefit provided is that datasets will be handled in a unified (and for the most part transparent) way. In the process, we will document and list all the datasets that are being used. This will also fix issues as https://github.com/CliMA/ClimaLSM.jl/issues/423.

With this infrastructure, we will also be able to provide precomputed artifacts (e.g., topo maps already regridded to standard resolutions) that will speed up startup time.

People and Personnel

Task Breakdown And Schedule

Next actions:

CC

@tapios @simonbyrne @cmbengue

tapios commented 8 months ago

Thank you for the thorough and detailed plan! It looks good.

Could you please add some rough time estimates for the tasks (especially the first bullet)?

Sbozzolo commented 8 months ago

Thank you for the thorough and detailed plan! It looks good.

Could you please add some rough time estimates for the tasks (especially the first bullet)?

If no unexpected problems come up, the first bullet point will take one day of work for a draft implementation. We might have something by the end of this week.

What will take much longer is collecting the census of all the artifacts alongside their metadata (and the script needed to generate the datasets). However, this can be done in a perfectly incremental and non-disruptive way.

tapios commented 8 months ago

Sounds great!

Sbozzolo commented 6 months ago

This is implemented in ClimaArtifacts and ClimaUtilities.

As a test, I uploaded the fluxnet2015 data (41 GB) to the cluster. Any user that uses climacommon can access the data with

using Artifacts
fluxnet_data_folder = artifact"flunext2015"

as long as flunxet2015 is defined in the Artifacts.toml of their package. Accessing artifacts this way has 0 runtime cost. However, the preferred way to access artifacts is

using ClimaUtilities.ClimaArtifacts
fluxnet_data_folder = @clima_artifact("flunext2015"[, context])

When provided the ClimaComms context, this is MPI-safe. This has nearly 0 runtime costs, but it also keeps track on what artifacts are being accessed in a given simulation.

Among the other things, ClimaArtifacts:

A user/developer guide is provided in the README. ClimaArtifacts also provides an interactive guided procedure to create new artifacts.

The next step in this is having someone who is not me trying to add and use a new artifact using the provided documentation.