Paper: Cloud infrastructure provenance collection and management to reproduce scientific workflows execution

URL: https://www.sciencedirect.com/science/article/pii/S0167739X17314917

Intro

Cloud computing provides dynamic, on-demand computing, and scalable resources that enable the processing of complex workflow-based experiments.
Cloud infrastructure behave like a black-box, resources are allocated on-demand.
Introduce a mechanism to capture the provenance of workflow execution on the cloud to re-create the workflow.
Capturing hardware dependencies becomes a challenging issue in the context of cloud in which resources can be created or destroyed at runtime.
ReCAP propose a mapping approaches to 1) capture the cloud-aware provenance information in various cloud usage scenarios 2) re-provisioning the execution resource on the cloud with similar configuration.
In similar works, collected provenance data provides information about the file access but it does not provide information about the resource configuration.
ReCAP is not relying on annotations rather it directly interacts with the cloud middleware at runtime to acquire resource configuration information and then establishes mapping between workflow jobs and cloud resources.

Workflow execution scenario on the cloud

Create a virtual environment on top of the cloud resources and execute workflows (fig. 2).
Pegasus used as a WMS along with the Condor cluster on cloud infrastructure to execute workflow jobs and each VM has a condor instance to execute the user’s job (fig. 2).
Scientist use a workflow tool or use an existing workflow from database and submit it through Pegasus.
Pegasus schedules the workflow jobs to VMs and then retrieves the provenance information such as job logs, job arguments and host information which is not sufficient to re-provisioning resources.

ReCAP architecture

A plugin-based mechanism that include:

WMS Wrapper service: to process a workflow submission requested by user which include information about WMS, storage, environmental variables and workflow containers.
WS client: the client component interact with the wrapper service and submits the user files.
WMS Layer: this layer load appropriate plugins during mapping process and include monitor and parser components. Monitor plugin interact with database to retrieves workflow and job states and provenance And then interact with parser to parse job outputs.
Cloud Layer: interact with the cloud middleware and capture two type of information 1) about virtual machines (CRM) 2) workflow’s input and output files (CSM).
Aggregator: Performs the mapping between the workflow job information collected from the Workflow Provenance component and the cloud resource information collected from the Cloud Layer Provenance component.
WF-Repeat: is designed to re-execute a workflow on the cloud.
Comparator: to evaluate the workflow reproducibility.

Job-to-cloud resource mapping

Resource mapping approaches provide two pieces of information, hardware and software configurations.

Most of the WMS maintain either a unique IP or name information to access the provenance information. Two different resource usage scenarios on the cloud:

Static environment in cloud: resource will be accessible even after a workflow’s execution is finished.
Dynamic environment on cloud: resource are provisioned on-demand and released when they are no more required. Eager and Lazy approaches.

Static approach: fig. 9 shows the static mapping between a list of jobs of a given workflow (from Pegasus database) and a list of VMs in the cloud. The mapping is established by matching the IP addresses, that’s why the mapping is not possible for dynamic environments since resources will not be available after running.

Eager approach: establish job-to-cloud mapping in two phase: 1) Temporary mapping between the job and cloud resource is established since jobs are still running. 2) Final job-to-cloud resource mapping through retrieving job information from the workflow provenance captured by the WMS.

Lazy approach: eager approach relies on Condor and to overcome these dependencies Lazy approach is devised for dynamic environments. This algorithm does not maintain a temporary relation between a job and the virtual machine, but it periodically monitors the current status of the available VMs running on the cloud infrastructure and retrieves their Meta data information.

This approach is not efficient in terms of discovery time, but it will work for all the scenarios especially since there would not be any info which job on which machine is running.
This approach cannot be able to determine a mapping between job and a VM if there is no host information available.

Result

In order to evaluate the proposed work, three different workflows – Montage, ReconAll, Wordcount - were executed using ReCAP and their captured provenance were found to be consistent (table 3-6).

Questions:

Can we verify the reproducibility or irreproducibility of a workflow by comparing just the allocated execution resource?
The execution time of ReconAll workflow decreased when the same workflow was re-executed on a resource with better configuration. Based on the different definition of reproducibility, can we evaluate the reproducibility of a workflow based on the execution time?

Note: I think capturing the resource configurations is necessary but not sufficient in case of reproducibility evaluation.

big-data-lab-team / reading-club