fls-bioinformatics-core / auto_process_ngs

Scripts and utilities for automatic processing & management of Illumina NGS sequencing data.
Other
9 stars 6 forks source link

Fix QC pipeline to correctly handle single-ended SRA Fastq data #865

Closed pjbriggs closed 1 year ago

pjbriggs commented 1 year ago

Updates to correctly handle sets of single-ended SRA Fastq files in the QC pipeline, which don't have explicit read numbers in the Fastq names compared to paired-end SRA Fastqs (e.g. single-end version SRR123456.fastq.gz versus paired-end R1 version SRR123456_1.fastq.gz).

The lack of an explicit read number meant that the AnalysisFastq class (in the analysis module) returned the read number as None, causing issues in QC pipeline tasks where Fastqs would be filtered out unless their read number matched one of those assigned in the QC protocol (for example, the GetBamFiles task).

An update to the AnalysisFastq class now sets the read number to 1 rather than None for single-ended SRA Fastqs, and sets a new flag implicit_read_number to indicate that this has been assumed.

The update includes a pair of unit tests for the QC pipeline to check that both single and paired-ended SRA Fastqs are now handled correctly.