ActiveBrainAtlas / MouseBrainAtlas_dev

1 stars 2 forks source link

Parallelized conversion of CZI files #6

Open jsiddhant opened 4 years ago

jsiddhant commented 4 years ago

Requirements

CZI files need to be converted to TIFs as part of the workflow. This conversion is computation-intensive and hence needs to parallelized on EC2 instances. For this, a script needs to be created to spin up a set amount of instances, pickup CZI files to be processed from a directory and run the processing of the CZI files in parallel.

Steps Overview

Steps Details

Uploading to temporary S3 directory

The script needs to take as input from the user a directory where the CZI files are stored and upload them to a temporary S3 folder in the dedicated CZI-process bucket.

Spinning up EC2 Instances to process files

The script needs to spin up a number of instances as specified by the user. As these instances will be used to process the CZI files they need to have the libraries required for the conversion. For this, we will use Launch templates (https://docs.aws.amazon.com/autoscaling/ec2/userguide/create-launch-template.html) this will allow us to specify the libraries as AMI Images and set information of the hardware configuration for the instances being spun up. This will also be version controlled with comments so we can easily update instance configurations without changing the script.

Queuing Files for processing on EC2 instances

To run the processing the script will take as input the temporary S3 directory where the CZI files were uploaded (Step 1). All CZI files in this directory will act as the list of files to be processed. The multi-threaded script running locally will act as the coordinator. Each script running locally will run a command to download a specified file, then process it and once the processing is complete upload it to S3. This will run in parallel and new files will be queued as soon as the previous ones are finished processing.

Downloading processed files

The processed files will then need to be downloaded from S3.

Deliverables

  1. Script that can execute the steps outline above.

  2. Technical Documentation explaining implementation.

  3. User Guide

jsiddhant commented 4 years ago

Open points:

  1. How will it be integrated with the existing pipeline?