IU-CACR / SWIP

Scientific Workflow Integrity with Pegasus (SWIP) issue tracking (no source code).
1 stars 0 forks source link

Chaos Jungle #2

Open von opened 6 years ago

von commented 6 years ago
von commented 6 years ago

RENCI has completed Chaos Jungle implementation and launched on ExoGENI. Next steps are to benchmark CJ implementation to see overhead for different modes of packet mangling. If you mangle TCP payload on SSH/SCP stream, it just exits. HTCondor would be good since it doesn't do integrity checking on data payloads.

anriban commented 6 years ago
  1. Evaluating tools to find out overheads of chaos jungle - experimented with qperf, nuttcp and sar. Ongoing work.

  2. RENCI will set up a slice with the Fedora 25 image and provide access to ISI. ISI will install HTCondor, Pegasus and example workflow. RENCI will snapshot the image.

rynge commented 6 years ago

Expectations for when we have a working workflow in chaos jungle: Pegasus does currently not classify error or provide summary of what type of errors happened during execution (issue #1) so we will not get a good view of how the errors introduced by chaos jungle were handled.

Currently: some errors will be handled by job retries which might result in a successful workflow. In this case the user would know there had been an error, unless the user goes looking. If an error takes place in a non-recoverable place, the workflow will end with a failure.

Goal: Pegasus would be able to automatically handle all the errors, using methods like job retries and transfer retries. At the end, some kind of report could be provided on the errors encountered.

anriban commented 6 years ago

Update: 09/25/17

  1. Working on data collection for finding overheads of chaos jungle. Identified "nuttcp" and "sar" as most relevant tools to collect CPU utilization for RX and TX sides. Initial experiments point to no/minimal overhead for chaos jungle for data transfers. Will run more experiments and evaluate in the next couple of weeks.

  2. Encountered some issues while snapshotting the Fedora image with chaos jungle. Working on getting it fixed with our ExoGENI sysadmin. We think we know what is happening. Need some more testing to straighten this. Hopefully, it will be done by next week. Will send slice details to ISI after that.

anriban commented 6 years ago

Update: 10/09/17

The image issues were fixed with help from ExoGENI sysadmins, which were primarily related to image support for new kernels on the ExoGENI testbed. I have set up a slice with a Fedora image with chaos jungle. The slice topology and access details have been provided to Mats and Karan so that they can set up HTCondor+Pegasus and an example workflow on the slice. We will snapshot the image when ISI is done. I have the slice up until 10/20/17, but I can extend it as required.