This repository contains a computational containerized environment that automatically creates a workflow execution's record trail and invisibly attaches it to the workflow's output, enabling data traceability and results explainability.
Motivation: To trust findings in computational science, scientists need workflows that trace the data provenance and support results explainability. As workflows become more complex, tracing data provenance and explaining results become harder to achieve. Our solution transforms existing container technology, includes tools for automatically annotating provenance metadata, and allows effective movement of data and metadata across the workflow execution.
Environment overview: Our environment includes two stages: the execution of the workflow while automatically collecting metadata, and the analysis of the metadata. For the first stage, our environment decouples data and applications of traditionally tightly coupled workflows and encapsulates them into individual fine-grained containers. We augment both data and application containers to expose provenance metadata and to move data across the containerized workflow effectively. For the second stage, we provide an interface for visualizing and studying the metadata that scientists can use to understand the data lineage and the computational methods.
Use case: We include the demonstration of the capabilities of our environment with the study of an earth science workflow. This workflow predicts soil moisture values from the 27 km resolution satellite data down to the fine-grain 10 m resolution necessary for practical use in policymaking and precision agriculture using a suite of machine learning modeling techniques. By running the workflow in our environment, the end-user can identify the causes of different accuracy for predicted soil moisture values in different resolutions of the input data, and link different results to different machine learning methods used during the soil moisture downscaling, all without requiring the scientist to know aspects of workflow design and implementation.
This document is organized in the following order:
There are two main components to install to run our environment: 1) Apptainer and the plugins for the execution of the workflow and automatic collection of the metadata; and 2) a Jupyter Notebook for the interface that enables the visualization and study of the metadata to understand the data lineage and the computational methods.
Install Apptainer by following these instructions. A version greater than 3.5 is required to enable the plugins and zero-copy data transfer between containers.
Clone our repository
git clone --recurse-submodules https://github.com/TauferLab/ContainerizedEnv
Install the metadata plugin
apptainer plugin compile plugin/.
sudo plugin install plugin/plugin.sif
The interface is a Jupyter notebook that has the next required dependencies:
We use anaconda to install the software stack. If you do not have Anaconda installed, you can follow the instructions here to install it.
Once you have anaconda, you can create the environment. To this end, make sure to change the prefix in install/env_conda.yml
to the location of Anaconda in your local machine (e.g., /opt/anaconda3/
, /home/opt/anaconda3/
). You can use whereis conda
to check the path.
Run the next commands on your local machine:
conda env create -f install/env_conda.yml
conda activate tric
Once you have your environment installed, you can run jupyter notebook
and run the interface/metadata-interface.ipynb
Pick a sample workflow
cd sample_workflows/knn
cd sample_workflows/sbm
cd sample_workflows/rf
Initilize the workflow containers
apptainer workflow --create knn_workflow.json
apptainer workflow --create sbm_workflow.json
apptainer workflow --create rf_workflow.json
apptainer workflow --create
Run the workflow
apptainer workflow --run knn_workflow.json
apptainer workflow --run sbm_workflow.json
apptainer workflow --run rf_workflow.json
apptainer workflow --run your_workflow_name.json
Explore the metadata using the metadata interface
Navigate to your desired metadata directory
cd sample_workflows/knn/metadata
cd sample_workflows/sbm/metadata
cd sample_workflows/rf/metadata
Use the metadata interface
jupyter notebook
and select the interface/metadata-interface.ipynb
notebookSrc_ContainerizedEnv/sample_workflows/knn/metadata/
We aknowledge the support of Sandia National Laboratories; the National Science Foundation through the awards 1841758, 2103845, 2138811, and 1941443; and IBM through a Shared University Research Award. The authors acknowledge the Singularity team, specially Cedric Clerget and Ian Kaneshiro, for the support. This work was partially developed and tested using Jeststream XSEDE computing resources.
Developers:
Project Advisors:
Copyright (c) 2022, Global Computing Lab