epam / cloud-pipeline

Cloud agnostic genomics analysis, scientific computation and storage platform
https://cloud-pipeline.com
Apache License 2.0
146 stars 59 forks source link

Automation of NGS Project data pre-processing #2494

Open SilinPavel opened 2 years ago

SilinPavel commented 2 years ago

Background

Cloud-Pipeline has flexible model of object metadata along with project templates. With this model a lot of common tasks could be accomplished, such as grouping and managing together a set of data, a metadata, and pipelines. However a lot of pre-processing work need to be done with this set of artifacts to start data pipeline and get final result.

Automation of pre-processing of ngs data (parsing and registration of sample sheet for a machine run, generation of samplesheet, etc) could really improve user experience along with reducing overall time of data processing cycle.

Approach

To start with implementation the next approach is offered:

NGS Project pre-processing API

New branch of API that will contain the next methods for now:

ngs-project-data-sync-service

A new service that will synchronize state of raw data with a state of project in Cloud-Pipeline system in real time:

Workflow for ngs-project-data-sync-service:

Here is a pseudo-code for service loop:

**Start of the next service cycle**
# type: project; project-type: ngs-processing (configurable through a preference)
foldersToProcess = filterFoldersByType()  
for folder in foldersToProcess:
    ngsDataPath =  getNgsDataPath(folder)
    if ngsDataPath is None:
       log.warn
       continue

    machineRunFolders = listFolder(ngsDataPath)
    for mrFolder in machineRunFolders:
        machineRunEntity = getMetadataEntity(mrFolder)
        samplesheet = findSampleSheet(mrFolder)

        if machineRunEntity is None:
            machineRunEntity = createMachineRunEntity(mrFolder)

        if samplesheet is None or getTimestamp(samplesheet) <= machineRunEntity.timestamp:
            log.info
            continue

        api.createSampleSheet(mrFolder, machineRunEntity, samplesheet)
mzueva commented 2 years ago

@SilinPavel @sidoruka TODO for SampleSheet parsing: