Automation of NGS Project data pre-processing

Background

Cloud-Pipeline has flexible model of object metadata along with project templates. With this model a lot of common tasks could be accomplished, such as grouping and managing together a set of data, a metadata, and pipelines. However a lot of pre-processing work need to be done with this set of artifacts to start data pipeline and get final result.

Automation of pre-processing of ngs data (parsing and registration of sample sheet for a machine run, generation of samplesheet, etc) could really improve user experience along with reducing overall time of data processing cycle.

Approach

To start with implementation the next approach is offered:

NGS Project pre-processing API

New branch of API that will contain the next methods for now:

Registration of samplesheet file for specific project and specific machine run in this project
- The next checks are done before registering samplesheet
  - Project has right type
  - Data folder exists
  - Machine run metadata entity exists
  - Folder for this machine run also exists
- For each record in sample sheet it will:
  - Creates metadataEntity and link it to Machine Run metadataEntity.
  - Save content of the samplesheet to machine run data directory
Deletion of samplesheet and all metadata information related to this samplesheet
- For each MetadataEntity linked to provided MachineRun it will:
  - Remove such metadataEntity.
  - Remove samplesheet file from machine run data directory

ngs-project-data-sync-service

A new service that will synchronize state of raw data with a state of project in Cloud-Pipeline system in real time:

Automation of registration/updating of newly uploaded samplesheet
Starting a workflow for newly updated projects/machine runs

Workflow for ngs-project-data-sync-service:

At the time of the next service cycle:
- All folders that should be synchronized are marked as type: project; project-type: ngs-processing (configurable through a preference); ngs-data: <...>
- In the ngs-data folder there is already a data (one folder, one machine run)

Here is a pseudo-code for service loop:

**Start of the next service cycle**
# type: project; project-type: ngs-processing (configurable through a preference)
foldersToProcess = filterFoldersByType()  
for folder in foldersToProcess:
    ngsDataPath =  getNgsDataPath(folder)
    if ngsDataPath is None:
       log.warn
       continue

    machineRunFolders = listFolder(ngsDataPath)
    for mrFolder in machineRunFolders:
        machineRunEntity = getMetadataEntity(mrFolder)
        samplesheet = findSampleSheet(mrFolder)

        if machineRunEntity is None:
            machineRunEntity = createMachineRunEntity(mrFolder)

        if samplesheet is None or getTimestamp(samplesheet) <= machineRunEntity.timestamp:
            log.info
            continue

        api.createSampleSheet(mrFolder, machineRunEntity, samplesheet)

epam / cloud-pipeline