No SR in FORMAT field for all SV calls

tgong1 commented 6 years ago

Hi,

I used Manta for somatic SV calling with normalBam and tumorBam. In the somatic.vcf file, FORMAT field only has PR for all SV calls. Here is one example: chr1 3872266 MantaDEL:3:0:1:0:0:0 A ~~. PASS END=3882238;SVTYPE=DEL;SVLEN=-9972;IMPRECISE;CIPOS=-349,350;CIEND=-354,354;SOMATIC;SOMATICSCORE=333 PR 111,0 0,69~~

I also run my normalBam and tumorBam separately and checked the dipoidSV.vcf file. The FORMAT field does not have SR for tumorBam. There are only 8 BND and 2 DEL have SR in vcf file. Here is one example: chr1 1392432 MantaDEL:0:0:0:0:0:0 A ~~216 PASS END=1392932;SVTYPE=DEL;SVLEN=-500;IMPRECISE;CIPOS=-336,337;CIEND=-339,339 GT:FT:GQ:PL:PR 1/1:PASS:40:269,43,0:0,15~~

This is a simulated sample. I run Manta with both clinical samples and simulated sample before. But all VCF files gave me both PR and SR. I'm wondering if that is due to some filtering parameters.

I also checked stats/svLocusGraphStats.tsv file and the EvidenceType_split_align are non-zero for both tumor and normal samples. NotFilteredAndSemiAligned and EvidenceType_semalign are zero.

I hope you can give me some ideas how to solve the problem.

Thank you very much, Tingting

x-chen commented 6 years ago

The calls you showed are imprecise calls, which don't have split read (SR) support, usually due to the failure to assemble the alternative allele contig. There is more detailed explanation in the User Guide: https://github.com/Illumina/manta/tree/master/docs/userGuide

For each structural variant and indel, Manta attempts to assemble the breakends to basepair resolution and report the left-shifted breakend coordinate (per the VCF 4.1 SV reporting guidelines), together with any breakend homology sequence and/or inserted sequence between the breakends. It is often the case that the assembly will fail to provide a confident explanation of the data -- in such cases the variant will be reported as IMPRECISE, and scored according to the paired-end read evidence only.

tgong1 commented 6 years ago

Thank you for the reply. I checked using IGV and can see those split reads (with MAPQ>30). Previously, I run Manta with a similar simulated data, with lower depth of coverage and got SR reported and fewer IMPRECISE SV calls.

Are there any other reasons for SVs that no SR support reported by Manta, but can be seen using IGV?

Thank you very much, Tingting

x-chen commented 6 years ago

Which versions of Manta did you use before and now? What's the coverage difference before vs. now? You may consider use --generateEvidenceBam option to generate evidence bams and check if the split reads you identified in IGV are included in the evidence bams.

tgong1 commented 6 years ago

Thank you very much for the quick reply. I used manta1.3.0 for all other samples. For this sample, I used both Manta1.3.0 and 1.4.0, but with same results (no SR). The coverage is Tumor60x and Normal60x for this simulated matched tumor and normal sample. The previous one is Tumor15x and Normal15x. But my other clinical samples are around Tumor80x and matched blood40x, which reported SR in VCF. I actually checked with lower coverage, by downsampling the bam for the same tumor bam and saw a couple more SV with SR reported (not many, only 4 or 7 SV). I'm not sure if they are related.

Thank you again and I will try the option you suggested.

Thanks, Tingting

tgong1 commented 6 years ago

Hi, I checked the evidence bam using IGV, comparing with the tumor bam. There are only read pair signatures. But discordant read pairs can also have one of the read as split-read. For other split reads (or soft clipping reads), they were not in evidence bam.

I checked those split reads. They have MAPQ>30 and are not supplemental/secondary alignments. Are there any other reasons that Manta did not use them?

Thanks, Tingting

x-chen commented 6 years ago

You may consider running Manta 1.4.0, with --generateEvidenceBam option, on both the old data set and the new data set.
Then you would be able to identify split reads in the evidence bam from the old data set.
By comparing those split reads with those you identified from IGV but not in the evidence bam from the new data set, you may be able to get some clue.

After you get such split reads, feel free to post a couple of examples from each data set if possible.

tgong1 commented 6 years ago

Hi, Thank you for the suggestions.

I run Manta1.4.0 on both old data set (SR in vcf) and new data set (no SR in vcf). I then checked some records of split reads in both evidence bam and original bam from old data set. I also checked some records of split reads in original bam, but not in evidence bam from new data set. Here I had some examples in the file attached. I can't see significant differences in the alignment quality of those split reads, not considered as evidence. But I saw the difference of baseball quality scores in the records in bam file. I'd like to mention that higher base error rate (0.02) in new data set than that (0.01) in old data set. I'm not sure if that is related. Please again give me some ideas of that.

Manta_SR_problem.txt

x-chen commented 6 years ago

The basecall quality score matters. In the examples you showed, the reads with softclipped bases would be considered as split read evidence for a SV candidate. However, Manta requires those bases have a basecall quality score >= 20. In the old examples, the basecall quality score was 20; but it was 17 in the new examples, which was the reason those reads being filtered out. The rationale is that if a basecall is of low confidence, then the evidence of poor alignment should be of low confidence as well.

tgong1 commented 6 years ago

Hi, Thank you very much for all the help. How you interpret the baseball quality score from the reads records in BAM? I will check the use of the short read simulator to see if the baseball quality score in BAM is the true reflection of the base quality. Thank you again, Tingting

x-chen commented 6 years ago

Please refer to the SAM format document: https://samtools.github.io/hts-specs/SAMv1.pdf

QUAL: ASCII of base QUALity plus 33 (same as the quality string in the Sanger FASTQ format). A base quality is the phred-scaled base error probability which equals −10 log10 Pr{base is wrong}. This field can be a ‘’ when quality is not stored. If not a ‘’, SEQ must not be a ‘*’ and the length of the quality string ought to equal the length of SEQ.

Illumina / manta

No SR in FORMAT field for all SV calls #160