PapenfussLab / gridss

GRIDSS: the Genomic Rearrangement IDentification Software Suite
Other
257 stars 71 forks source link

Empty base quality strings in fastq created during VIRUSBreakend for old bam files #603

Open toddajohnson opened 2 years ago

toddajohnson commented 2 years ago

I ran GRIDSS against some old paired normal-tumor HCC samples that we wanted to re-analyze, and for a couple of the sets, the quality score was noted by Fastqc as in Illumina 1.5 encoding. For those files, I passed them through seqkit convert to output modified fastq with Sanger encoding. Those files have been analyzed by bwa mem and biobambam2 bamsort & markdups, GRIDSS 2.13.2, and variant calling by SAGE 3.2 without any complaints. However, when I ran VIRUSBreakend, those two sample sets crash at the "Determining viral references to use" point due to "Exception in thread "main" htsjdk.samtools.SAMException: Missing Quality Line at line 5 in fastq /dev/fd/63". I checked the output of SAMPLE.bam.extract_reads.sh (gridsstools extractFragmentsToFastq), and the fastq files that should be input into the viral reference determination are generally missing the base quality string. Probably, this relates to the generally poor quality of the old data and long strings of "B" Illumina 1.5 quality codes that have been converted to strings of "!" in the modified fastq.

However, I have looked through the GRIDSS code that related to these steps and looked around HTSJDK SAMUtils, but I don't see where it might be expected to completely remove a base quality string because of runs of "!".

Here is an example: Reads exported using samtools:

@HWI-ST672_0075:6:25:9176:134054#0 TAGGGTTAGGGTTAGGGTTAGGGTTAGGGGTAAGATCGGAAGAGCGGTTCAGCAGGAATGCCGGGACAGATATCGAATGCCGTCCTCTGCGTTAAAAAAGA + D@ED@DBDD?=.?3B4=+:3B@B/A=CA@+A:BB9<=DDD93<:'>?!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! @HWI-ST672_0075:5:41:11107:176617#0 GCCCAGGAACGTCCCAGTGTCAGACCCAAAGTCCTGGCACACCCCTCTGGCATCAGATCGGAAGAGCGGTTCAGCAGGAATGCCGAGACCGATACCGTATG + ABBB<@CA@ECECEFEFFFFFD@FFF8FFEBEE87@<<EB;EEBB?DD6A@EEEBEECAA6C?ADDDAEEBB>A??>@>>?DC?E?EE!!!!!!!!!!!

Reads from VIRUSBreakend fq files or from manually running SAMPLE.bam.extract_reads.sh:

@HWI-ST672_0075:6:25:9176:134054#0 TAGGGTTAGGGTTAGGGTTAGGGTTAGGGGTAAGATCGGAAGAGCGGTTCAGCAGGAATGCCGGGACAGATATCGAATGCCGTCCTCTGCGTTAAAAAAGA +

@HWI-ST672_0075:5:41:11107:176617#0 GCCCAGGAACGTCCCAGTGTCAGACCCAAAGTCCTGGCACACCCCTCTGGCATCAGATCGGAAGAGCGGTTCAGCAGGAATGCCGAGACCGATACCGTATG +

Any ideas if there is a solution to this?

YingYa commented 1 year ago

The same issue as #577

When we got the issue 'Exception in thread "main" htsjdk.samtools.SAMException: Missing Quality Line at line XXXX in fastq /dev/fd/63', we could manually extract Fragments from BAM/CRAM To Fastq by other tools, for example:

cd /path/to/tumor.virusbreakend.vcf.gz.virusbreakend.working samtools view -h -u -@ 8 -N tumor.virusbreakend.vcf.gz.readnames.txt --reference /path/to/hg38.fasta /path/to/tumor.cram | samtools fastq -@ 8 -1 tumor.R1.fq -2 tumor.R2.fq -0 tumor.unpaired.fq -

Then rerun the virusbreakend shell. virusbreakend will start from the last breakpoint and use the manually extracted tumor.R1.fq, tumor.R2.fq and tumor.unpaired.fq.

Hope this could be helping.

jaesvi commented 8 months ago

I am finding the same error as @YingYa and @toddajohnson. The conflicting read is :

@A00537:899:HHGL2DSX7:4:1523:14796:1000
NGAGAGGAGTCTACCAGGAATCTTCTGCTGTAGATCGGAAGAGCACACGTCTGAACTCCAGTCACCCGCTAAAATCGCGGGTGGCGGATTGGGGGTGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG
+
!FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF,FF:,,:,,,,,,,,,:,,,,,,,,,,,::FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF

and it is extracted correctly using samtools fastq as @YingYa mentioned above. However from the gridss extract_reads command the resulting read is:

@A00537:899:HHGL2DSX7:4:1523:14796:1000
NGAGAGGAGTCTACCAGGAATCTTCTGCTGTAGATCGGAAGAGCACACGTCTGAACTCCAGTCACCCGCTAAAATCGCGGGTGGCGGATTGGGGGTGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG
+

My best bet is on the ! character on the quality line, since I can't find it on reads extracted by gridsstools:

> grep -c "!" READS_GRIDSS.viral.*.fq 
READS_GRIDSS.viral.R1.fq:0
READS_GRIDSS.viral.R2.fq:0
READS_GRIDSS.viral.unpaired.fq:0

and 7 times on the ones extracted by samtools fastq:

> grep -c "!" READS_SAMTOOLS.viral.*.fq 
READS_SAMTOOLS.viral.R1.fq:3
READS_SAMTOOLS.viral.R2.fq:4
READS_SAMTOOLS.bam.viral.unpaired.fq:0

Do you think you will have time to patch it @d-cameron ? Thanks!

toddajohnson commented 5 months ago

Started analyzing more old data and refound this problem.

It seems to come down to where VBE uses gridsstools' extractFragmentsToFastq or unmappedSequencesToFastq commands, both of which write the extracted sequences to the fastq file using src/main/c/gridsstools/fastq.c. I tried a bam that was giving normal output using samtools fastq but missing base quality score lines using grids stools extractFragmentsToFastq

The culprit seemed to be in: Line 33-46 of src/main/c/gridsstools/fastq.c:

char *qual = bam_get_qual(record);
if (qual && *qual != '\xff') {
    size_t qual_len = strnlen(qual, record->core.l_qseq);
    if (end <= qual_len) {
        if (bam_is_rev(record)) {
            for (i = end - 1; i >= start; i--) {
                kputc(FASTQ_PHRED_QUAL_OFFSET + qual[i], linebuf);
            }
        } else {
            for (i = start; i < end; i++) {
                kputc(FASTQ_PHRED_QUAL_OFFSET + qual[i], linebuf);
            }
        }
    }
}

On the third line above, the length of qual_len seems to be 0 when the qual string contains bases with 0 quality (Phred score "!"=33), since before converting (int value +33), those characters in the string are NUL values, which are not considered a valid string by strnlen. I did not debug far enough to see if that really is "0" in any positions or just returns 0 length when leading or trailing positions had the NUL value, but changing the third line to: size_t qual_len = record->core.l_qseq; resulted in the quality score lines being output and VBE running to completion. I did not see in the calling code from gridsstools any case where the qual variable returned from bam_get_qual would be different than that returned by record->core.l_qseq, but perhaps that line needs a modified test and not just my cludge to grab the length of the qseq.