Clinical-Genomics / cg

Glue between Clinical Genomics apps
8 stars 2 forks source link

Calculation for sample reads does not account for low q30 score #3679

Closed beatrizsavinhas closed 1 month ago

beatrizsavinhas commented 1 month ago

Reads from lanes that have low q30 values should not be included to calculate sample reads, So that only reads of good quality are used for running analyses.

Acceptance Criteria

Notes

The current logic for updating the Sample.reads, sums all the reads in each lane, regardless of q30 values: https://github.com/Clinical-Genomics/cg/blob/7b1c4d1c81da518d56bc414c8e845fb340721556/cg/store/crud/update.py#L61-L68

When it comes to storing fastq files in housekeeper though, only fastq files from lanes that that pass the q30 are stored: https://github.com/Clinical-Genomics/cg/blob/33e9f5e5ad0b2a1a6f0889407f6b259ca062cc5d/cg/services/illumina/post_processing/housekeeper_storage.py#L24-L52

Essentially, we have a count for sample.reads that does not correspond to the reads in the fastq files that we actually use for the analyses.

The previous logic, however, did take low q30 into account - See https://github.com/Clinical-Genomics/cg/issues/2197.

Implementation plan

diitaz93 commented 1 month ago

Previously (in b8ef794ab), we had: https://github.com/Clinical-Genomics/cg/blob/b8ef794ab1250c337101f7634d82468b803be009/cg/store/crud/read.py#L394-L408 https://github.com/Clinical-Genomics/cg/blob/b8ef794ab1250c337101f7634d82468b803be009/cg/meta/demultiplex/status_db_storage_functions.py#L127-L138 The filters were not updated when changing to the new database models