Closed seallard closed 1 month ago
Don't forget this one @diitaz93, @ChrOertlin, @islean, @Vince-janv
When post processing completes for a flow cell, new data becomes available which potentially can be used to start analyses.
From: post processing completed for flowcell. To: analyses started for cases.
For example, the MIP-DNA workflow.
Current trigger: crontab which runs workflow-mip-dna-start.sh
(cg workflow mip-dna start-available
)
Frequency: every hour.
Microbial analyses are started every 6th hour, so that could be sped up a lot.
FlowcellPostProcessing
DemultiplexingPostProcessing
DemuxPostProcessing
PostProcessingCompleted
The flow cell ID.
Up to 1 hour reduction for the state transition and TAT. No real decoupling. Resilience: we can have retries in the consumer trying to start the analyses. For example if a database, slurm or something else goes down temporarily for some reason.
Possible state transition 1: start analyses after post processing
When post processing completes for a flow cell, new data becomes available which potentially can be used to start analyses.
From: post processing completed for flowcell. To: analyses started for cases.
For example, the MIP-DNA workflow. Current trigger: crontab which runs
workflow-mip-dna-start.sh
(cg workflow mip-dna start-available
) Frequency: every hour.Possible topic names
* `FlowcellPostProcessing` * `DemultiplexingPostProcessing` * `DemuxPostProcessing`
Possible event names
* `PostProcessingCompleted`
Event content
The flow cell ID.
How does the message impact turn-around-time, resilience and decoupling?
Up to 1 hour reduction for the state transition and TAT. No real decoupling. Resilience: we can have retries in the consumer trying to start the analyses. For example if a database, slurm or something else goes down temporarily for some reason.
Are cases always run on just one flow cell?
Maybe a nice place to start might be to make the reading of workflow qc data into arnold event driven. Currently, this project is on hold since we have some re-working in trailblazer to do. However, it might be nice to create a producer for finished workflows and let cg consume to this to then trigger parsing of the qc data.
Name: 'AnalysisRun'
Name: 'AnalysisCompleted' Content: 'case_internal_id'
As I see it now it will resolve an immediate issue where this flow would be coupled to the trailblazer flow, however once a slurm job is finished successfully it could produce the event. It does not affect TAT but can offer other tangible benefits in storing data with valuable information for business logic.
When a case has been run, the analysis is to be uploaded. This could be done by setting up a listener on the analysis table in cg with a completed_at
trigger
From: The completed_at
has been set in the Analysis table
To: The upload command is run
Example: Tomte
Current trigger: systemd which runs cg upload auto --workflow tomte
Frequency: every tenth minute.
AnalysisStatus
AnalysisUpdate
AnalysisCompleted
Case internal id
Up to 10 minute reduction for the state transition and TAT. No real decoupling. Resilience: Might go down a bit. There are some more filters that need to be considered. I do not know if we need to set up more triggers to cover events where completed analyses are not to be uploaded.
Do we have to use a listener? It would be nice to publish the event from the upstream process.
Do we have to use a listener? It would be nice to publish the event from the upstream process.
No, we don't have to listen to database tables. We will explicitly publish events from the upstream process. There are many reasons as to why integrating kafka directly against our databases is a bad idea at this stage in our organization.
Mainly, I think it is a bad pattern because we have one monolith with one giant database that a lot of code writes to. What if other components write to the column? What if a one time script is run writing data?
It is much better if we are explicit about when an event is supposed to be published instead of introducing another layer of indirection.
Are cases always run on just one flow cell?
Nope! So the listener would still identify which cases are ready to be started.
Short rundown. We could set up a publisher in trailblazer which posts an event whenever an analysis is completed containing the case's internal id. Then we set up a consumer in cg which triggers the store
functionality. This would add some decoupling between cg and trailblazer.
When flow cells are transferred from the NAS or from PDC.
From: Copy complete flow cell. To: start demultiplexing.
Current trigger: crontab which runs start-demux.sh
(cg demultiplex all
)
Frequency: every 10 min.
Demultiplexing
FlowCellCopyCompleted
The flow cell ID/ full name.
Up to 10 minutes reduction for the state transition and TAT. Some buffer time in the DRAGEN (not start demultiplexing all FC at the same time Avoid flow cells being taken twice for demultiplexing by the automation (maybe?)
Here is a summary of all our crontab timers | Frequency | Description |
---|---|---|
Every Saturday at 02:05 | Microbial DB Backup | |
Daily at 00:00 | Upload processed cases to mutacc database | |
Daily at 03:10 | AWS DB Backup | |
Daily at 20:00 | Process solved cases with mutacc | |
Daily at 21:45 | Count Housekeeper files | |
Every 6th day-of-week at 20:00 | Delete files and empty dirs in customer inbox | |
Every 8th hour | Store completed microbial analyses in Housekeeper | |
Every 6th hour | Start all new microbial analyses | |
Every 4th hour | Check for received samples | |
Check for prepared samples | ||
Check for delivered samples | ||
Check for delivered pools | ||
Check for received pools | ||
Every 3rd hour | Upload results for MIP-DNA | |
Upload results for BALSAMIC | ||
Upload results for BALSAMIC-UMI | ||
Every 2nd hour | Store completed MIP-DNA analyses in Housekeeper | |
Every hour | Start analyses for BALSAMIC | |
Store BALSAMIC analyses in Housekeeper | ||
Fetch ONE requested flowcell from PDC | ||
Start analyses for MIP-DNA | ||
Store completed MIP-RNA analyses | ||
Start analyses for MIP-RNA | ||
Start available analyses for mutant | ||
Store available analyses for mutant | ||
Every 10 minutes | Create Novaseq demux sample sheet | |
Start demultiplexing of all flow cells | ||
Start available analyses for Fluffy | ||
Store available analyses for Fluffy | ||
Every 5 minutes | Scan for analyses |
Our system d timers | Frequency | Description |
---|---|---|
Every 10 minutes | cg-archive-update-job-statuses.service | |
Every 10 minutes | cg-demultiplex-finish-all.service | |
Every 10 minutes | cg-demultiplex-create-illumina-manifest-files.service | |
Every 10 minutes | cg-demultiplex-create-nanopore-manifest-files.service | |
Every 10 minutes | cg-demultiplex-confirm-flow-cell-sync.service | |
Every 10 minutes | cg-demultiplex-copy-completed-flow-cell.service | |
Every 3rd hour | cg-upload-mip-rna.service | |
Every 3rd hour | cg-upload-microsalt.service | |
Daily at 00:00 | sql-backup-remove.service | |
Daily at 00:00 | cg-compress-fastq.service | |
Daily at 00:00 | mongo-backup.service | |
Daily at 00:00 | mongo-backup-remove.service | |
Daily at 00:00 | sql-clean-binlog.service | |
Daily at 00:00 | sql-backup.service | |
Daily at 00:00 | cg-clean-analysis-balsamic-qc.service | |
Daily at 00:00 | cg-clean-analysis-balsamic-umi.service | |
Daily at 00:00 | cg-clean-analysis-microsalt.service | |
Daily at 00:00 | cg-clean-analysis-mip-rna.service | |
Daily at 00:00 | cg-clean-analysis-mutant.service | |
Daily at 00:00 | cg-clean-rsync-dirs.service | |
Daily at 00:00 | cg-clean-analysis-balsamic.service | |
Daily at 00:00 | cg-clean-analysis-fluffy.service | |
Daily at 00:00 | cg-clean-analysis-rnafusion.service | |
Daily at 00:00 | cg-clean-analysis-mip-dna.service | |
Daily at 01:00 | cg-clean-retrieved-spring-files.service | |
Daily at 01:00 | cg-clean-scout-finished.service | |
Daily at 01:00 | cg-compress-clean-fastq.service | |
Daily at 01:00 | cg-upload-all-fastq.service | |
Daily at 01:00 | cg-upload-nipt-all.service | |
Daily at 01:00 | cg-upload-rnafusion.service | |
Daily at 01:00 | cg-workflow-rnafusion-start-available.service | |
Daily at 01:00 | cg-workflow-rnafusion-store-available.service | |
Daily at 01:00 | cg-workflow-tomte-start-available.service | |
Daily at 01:00 | cg-workflow-tomte-store-available.service | |
Daily at 09:00 | log-storage.service | |
Daily at 18:00 | cg-backup-encrypt-flow-cells.service | |
Daily at 18:15 | cg-backup-flow-cells.service | |
Daily at 19:00 | scout-load-research.service | |
Every Friday at 04:00 | singularity-mutant.service | |
Every Saturday at 09:00 | cg-backup-archive-spring-files.service | |
Every Saturday at 20:15 | clean-stage-analysis-dirs.service | |
Every Sunday at 08:00 | cg-clean-flow-cells.service | |
Every Sunday at 08:00 | cg-clean-hk-case-bundle-files.service | |
Every Monday at 01:00 | cg-clean-retrieved-spring-files.service |
Done 🥳
Think about the following questions and prepare brief answers for a presentation and discussion.