Cloud-Pipeline has flexible model of object metadata along with project templates.
With this model a lot of common tasks could be accomplished, such as grouping and managing together a set of data, a metadata, and pipelines.
However a lot of pre-processing work need to be done with this set of artifacts to start data pipeline and get final result.
Automation of pre-processing of ngs data (parsing and registration of sample sheet for a machine run, generation of samplesheet, etc) could really improve user experience along with reducing overall time of data processing cycle.
Approach
To start with implementation the next approach is offered:
NGS Project pre-processing API
New branch of API that will contain the next methods for now:
Registration of samplesheet file for specific project and specific machine run in this project
The next checks are done before registering samplesheet
Project has right type
Data folder exists
Machine run metadata entity exists
Folder for this machine run also exists
For each record in sample sheet it will:
Creates metadataEntity and link it to Machine Run metadataEntity.
Save content of the samplesheet to machine run data directory
Deletion of samplesheet and all metadata information related to this samplesheet
For each MetadataEntity linked to provided MachineRun it will:
Remove such metadataEntity.
Remove samplesheet file from machine run data directory
ngs-project-data-sync-service
A new service that will synchronize state of raw data with a state of project in Cloud-Pipeline system in real time:
Automation of registration/updating of newly uploaded samplesheet
Starting a workflow for newly updated projects/machine runs
Workflow for ngs-project-data-sync-service:
At the time of the next service cycle:
All folders that should be synchronized are marked as type: project; project-type: ngs-processing (configurable through a preference); ngs-data: <...>
In the ngs-data folder there is already a data (one folder, one machine run)
Here is a pseudo-code for service loop:
**Start of the next service cycle**
# type: project; project-type: ngs-processing (configurable through a preference)
foldersToProcess = filterFoldersByType()
for folder in foldersToProcess:
ngsDataPath = getNgsDataPath(folder)
if ngsDataPath is None:
log.warn
continue
machineRunFolders = listFolder(ngsDataPath)
for mrFolder in machineRunFolders:
machineRunEntity = getMetadataEntity(mrFolder)
samplesheet = findSampleSheet(mrFolder)
if machineRunEntity is None:
machineRunEntity = createMachineRunEntity(mrFolder)
if samplesheet is None or getTimestamp(samplesheet) <= machineRunEntity.timestamp:
log.info
continue
api.createSampleSheet(mrFolder, machineRunEntity, samplesheet)
Background
Cloud-Pipeline has flexible model of object
metadata
along with project templates. With this model a lot of common tasks could be accomplished, such as grouping and managing together a set of data, a metadata, and pipelines. However a lot of pre-processing work need to be done with this set of artifacts to start data pipeline and get final result.Automation of pre-processing of ngs data (parsing and registration of sample sheet for a machine run, generation of samplesheet, etc) could really improve user experience along with reducing overall time of data processing cycle.
Approach
To start with implementation the next approach is offered:
NGS Project pre-processing API
New branch of API that will contain the next methods for now:
Registration of samplesheet file for specific project and specific machine run in this project
Deletion of samplesheet and all metadata information related to this samplesheet
ngs-project-data-sync-service
A new service that will synchronize state of raw data with a state of project in Cloud-Pipeline system in real time:
Workflow for ngs-project-data-sync-service:
Here is a pseudo-code for service loop: