Open Sbozzolo opened 9 months ago
Thank you for the thorough and detailed plan! It looks good.
Could you please add some rough time estimates for the tasks (especially the first bullet)?
Thank you for the thorough and detailed plan! It looks good.
Could you please add some rough time estimates for the tasks (especially the first bullet)?
If no unexpected problems come up, the first bullet point will take one day of work for a draft implementation. We might have something by the end of this week.
What will take much longer is collecting the census of all the artifacts alongside their metadata (and the script needed to generate the datasets). However, this can be done in a perfectly incremental and non-disruptive way.
Sounds great!
This is implemented in ClimaArtifacts and ClimaUtilities.
As a test, I uploaded the fluxnet2015 data (41 GB) to the cluster. Any user that uses climacommon
can access the data with
using Artifacts
fluxnet_data_folder = artifact"flunext2015"
as long as flunxet2015
is defined in the Artifacts.toml
of their package. Accessing artifacts this way has 0 runtime cost. However, the preferred way to access artifacts is
using ClimaUtilities.ClimaArtifacts
fluxnet_data_folder = @clima_artifact("flunext2015"[, context])
When provided the ClimaComms
context, this is MPI-safe. This has nearly 0 runtime costs, but it also keeps track on what artifacts are being accessed in a given simulation.
Among the other things, ClimaArtifacts
:
ArtifactsWrappers
(including non-determinism and corruption)A user/developer guide is provided in the README. ClimaArtifacts
also provides an interactive guided procedure to create new artifacts.
The next step in this is having someone who is not me trying to add and use a new artifact using the provided documentation.
The Climate Modeling Alliance: Workflow and technology to manage input data (OKR 2.3.7, 2.3.9, 3.4.1)
Software Design Issue 📜
Purpose
This SDI describes challenges and proposes a solution on how to manage input data. This SDI expands the policy document drafted by @sbryne addressing the issues with large datasets and parallel runs.
Science and technical goals
The technical solution
Background
The Julia language supports artifacts, pieces of non-source-code data that have to be distributed with a package. In Julia, artifacts are primarily used to distribute binary component that are needed to run a package (e.g., compiled libraries). Using the "official" data-distribution stream for Julia comes with several benefits that automatically check some of the boxes above, such as integrity verification and download on-demand.
Julia
artifacts
are tarballs declaratively defined in a packageArtifact.toml
file. They are accessed via the@artifact_str
macro, which ensures that the artifact is present when it needs to be used. This might require downloading the artifact upon instantiation of the package (for greedy artifacts), or downloading it when needed (for lazy artifacts). Users do not have to worry about the artifacts being present or their location as everything is handled behind the scenes.The
Artifact.toml
can contain additional fields that can be queried through thePkg.Artifact
APIs. For example, we can add theacknowledgment
field to generate automatically a report on what has to be acknowledged for the datasets used for a given simulation. We do not need to about this too much now, but it is good to know that we are working with an extensible framework.Building upon Artifacts
While Julia artifacts do not have all the capabilities we need, they provide a good starting point to build our solution. The three main shortcomings of Julia artifacts are that they don't handle large files, MPI runs, and they cannot process additional runtime information (e.g., if they are accessed).
Pkg.Artifacts
is modular enought that we implement the missing features.This SDI proposes to develop a new (small) package
ClimaArtifacts
.ClimaArtifacts
mainly provides two functionalities:@clima_artifact_str
, which fills the gaps in@artifact_str
, by handling parallel runs, artifact tagging, and all the complexity that we need but is not captured byPkg.Artifacts
Large datasets (and compute clusters/CI)
Large datasets always require special treatment (where "large" is yet to be quantified). For instance, we cannot assume that any one folder is large enough to contain 1 TB of data. Instead, we need to the user take care of obtaining and storing that data where they see fit. Luckily, this can be handled directly with
Pkg.Artifact
.The Julia artifact system provides a reasonably good way to connect external data with the artifact system (albeit, a little clunky to set up). Julia artifacts were developed to be distribute system libraries/binaries, but sometimes it is best to use the native library/binary on the system. To achieve this, Julia artifacts can be overridden with an
Overrides.toml
that specifies the path of a given artifact.Our packages will keep track the required artifacts in the
Artifacts.toml
but those that are too large will be marked as "not-downloadable" (by removing the.download
section).Let us look at an example
Artifact.toml
The
era5
artifact here cannot be downloaded because it does not have a.download
section. Trying to use that dataset will yield an error. To useera5
, users will have add aOverrides.toml
in their Julia depot:The clunkiness in setting this up comes from the fact that everything is done with hashes. The benefit of using hashes is that multiple packages that use the same dataset will share the same files instead of requiring copies. A question and potential problem that has not been investigated yet is related to this.
Pkg.Artifacts
verifies the integrity of artifacts by computing and matching a cryptography hash. It has to be seen how this works when the artifact is a 1 TB folder. We suspect that the override system will automatically take care of this.The Caltech cluster
On the Caltech cluster, several repositories use shared depots, so
Overrides
files are shared across users. Once set up, everything should just work.At this point, we can use
/groups/esm
to store the data we need for runs. It has 30 TB of total space (24 TB currently used) and it is fast enough for our current needs./groups/esm
does not have purge policies.Distributed computing is difficult
While handling large data files can be dealt with with relative ease, artifacts for parallel runs are much more complex. When we run our codes with MPI, multiple copies of the same executable are running at the same time. This means that if a piece of data is not available and has to be downloaded, multiple processes will try to fetch the data and store it in the same location. This is problematic for at least two reasons: (1) we don't want 100s of identical network requests at the same time, (2) we don't want 100s of processes to try to write the same file in the same place. Number (1) is good etiquette and might lead to performance degradation for everyone on the system (and for slow servers, as the ones that might host scientific data, also everyone using the server). On a good day, number (2) also leads to performance degradation, but most generally it leads to race conditions and kills the run.
The ideal solution to this problem is that only the root process downloads the data. In turn, this requires extending the artifact system to be MPI-aware. This is the key reason why we need a new module.
Some of the implementation details
The core logic of how artifacts are handled in Julia can be found here. The key is that
@artifact_str
does some basic processing and calls the (private)_artifact_str
function for most of the heavy lifting.ClimaArtifacts
will provide a new@artifact_str
that does additional processing before invoking_artifact_str
.@clima_artifact_str
will be called with a small additional payload that contains theClimaComms
context (plus any additional structs that might be needed for tracking). When the context is not MPI,@clima_artifact_str
will just call_artifact_str
. When the context is MPI,@clima_artifact_str
will first have the root process download the required artifact, and then call_artifact_str
, ensuring that no additional download is performed.Public and archival data storage
CaltechDATA offers free archival storage for datasets below 500 GB (and paid storage for larger datasets). We already have a community there. CaltechDATA seem to use Amazon S3 as backend and my first benchmarks seem to indicate that it is reasonably fast download speeds.
Those that wish to upload their data to
CaltechDATA
need to:CaltechDATA
from the settingsProcess to add a new artifact
This section will be updated as the details become more clear
artifacts/<artifact-name>/create_artifact.jl
ClimaArtifacts
could automate some of this work)create_artiface
archive_artifact
Artifacts.toml
file.Cost/Benefits/Risks
The main risk associated to the proposed solution is that we are adding a layer in between
@artifact_str
and_artifact_str
. Future versions of Julia might change how this is implemented, leading to some maintenance work required.The key benefit provided is that datasets will be handled in a unified (and for the most part transparent) way. In the process, we will document and list all the datasets that are being used. This will also fix issues as https://github.com/CliMA/ClimaLSM.jl/issues/423.
With this infrastructure, we will also be able to provide precomputed artifacts (e.g., topo maps already regridded to standard resolutions) that will speed up startup time.
People and Personnel
Task Breakdown And Schedule
Next actions:
CC
@tapios @simonbyrne @cmbengue