Griffan / VerifyBamID

VerifyBamID2: A robust tool for DNA contamination estimation from sequence reads using ancestry-agnostic method.
http://griffan.github.io/VerifyBamID/
92 stars 15 forks source link

VBID2 not seeing read evidence even though it is present. #23

Closed yfarjoun closed 3 years ago

yfarjoun commented 3 years ago

I'm running VBID on single-ended data with a depth of about 16x (considering only high mapping and quality bases) but it's resulting in estimated contamination of 50% and reporting zero coverage and zero liklihoods (both LK0 and LK1)

SEQ_ID RG CHIP_ID #SNPS #READS AVG_DP FREEMIX FREELK1 FREELK0 FREE_RH FREE_RA CHIPMIX CHIPLK1 CHIPLK0 CHIP_RH CHIP_RA DPREF RDPHET RDPALT

sm1 NA NA 99976 0 0 0.5 -0 -0 NA NA NA NA NA NA NA NA NA NA

This is clearly not true as I see good coverage over the contamination sites, so I'm wondering if VBID filters the reads for some other reason. for example, I know that these reads are from short inserts and so they all underwent soft clipping during alignment, but I don't know if this would affect VBID (the 16x coverage I reference is not counting soft-clipped bases, of course.

here's an example of a contamination site with coverage in IGV: image

Any help understanding this would be greatly appreciated!!!

hyunminkang commented 3 years ago

Does samtools view work well? I wonder what happens if you run samtools view in that region.

Another possible reason could be that chromosome names were codified differently between VCF and BAM/CRAM?

Hyun.

Hyun Min Kang, Ph.D. Associate Professor of Biostatistics University of Michigan, Ann Arbor Email : hmkang@umich.edu

On Fri, Nov 20, 2020 at 3:56 PM Yossi Farjoun notifications@github.com wrote:

I'm running VBID on single-ended data with a depth of about 16x (considering only high mapping and quality bases) but it's given me a "low coverage" error:

Found zero likelihoods. Bam is either very-very shallow, or aligned to the wrong reference (relative to the vcf)

This is clearly not true as I see good coverage over the contamination sites, so I'm wondering if VBID filters the reads for some other reason. for example, I know that these reads are from short inserts and so they all underwent soft clipping during alignment, but I don't know if this would affect VBID (the 16x coverage I reference is not counting soft-clipped bases, of course.

here's an example of a contamination site with coverage in IGV: [image: image] https://user-images.githubusercontent.com/2745554/99849005-c6548680-2b48-11eb-82de-9410b7440b49.png

Any help understanding this would be greatly appreciated!!!

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/Griffan/VerifyBamID/issues/23, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABPY5OJUBV5JW5Q53OACKULSQ3JWJANCNFSM4T5H2HAQ .

yfarjoun commented 3 years ago

Thanks for getting back so quickly @hyunminkang

here's a snippet of reads over one of the snps:

0074087028  16  chr1    173652816   50  176S104M22S *   0   0   GTATTGCTGCTATTGTCTGCTATTGATTGTGATATGTCTGTCTGATACTGCTATAGTACTGCTATGTGTCCCTATTCCCCCCTGTGTTGCCCTTGGGCCCAGTCTTCCAGGCTCTCTATGGGGCCAGTCGGTGATCAGACGTGTGCTCTTCCGATCTTGTCTCGTTGATCTGTCTATAGTCTTGGGAGGATGTATGTGTCCAGGAATTTATCCATTTATTCTAGATTTTCTAGTTTATTTGTGTAGAGGTGTTTATAGTATTGACTGATGGTAGTTTGTAGATCGGAAGAGCGTCGTGTAGG  *'(1(-.%'.'/)**)+(-/*+++1.2;)3/+0.*0/195I205/2.557:I5I4II2IIIII:IIIII-,-4+22&.>>.&%.&+55"(3(==:::/7/)$1I//33I;;7I4/*II95;;511II9#@@.2I9(III,III68IIIII))7IIIIIIII/IIIIIIIIIIIIIIIIIIIIIFIFI==IIIIIIIIII,,IAAIIIIIIIBBIIIIIIIIIIII.II.IIII9I9ICICIIIII;I<<IIDIDIIIIIIIIIIII9I/IIIII2I2IIIIIII22IIII$2I,II:I6588  RG:Z:BC94   AS:i:104
0038515860  16  chr1    173652836   60  180S75M1D47M23S *   0   0   GTGCCTTGGCAGTCTCAGCTCTCTATGGGCAGTCGTGATCAGACGTGTGGCTCTTCGATCTACCAGAAATTATAAACTACAAGAATCATAGGCAAAATCTTCATGACATTGGTCTTGGTGATGATTTTTTTTTCAGACTCCAAAGGCACAGGTACAAAAGCACAGTTAGACAAATGGCATTGTCCAGGAATTTATCCATTTATTCTAGATTTTCTAGTTTATTTGTGTAGAGGTGTTTATAGTATTGACTGATGGAGTTTGTATTTCTGTGGGATCGGTGGTGACATCCCCTTTATCATTTTAGATCGGAAGAGCGTCGTGTAGG   *33..7766,*II5/3I8IIIIIIII595&I.I;")7I2'I9+&7III++I*"::$5I#+8---.I.4.11:"%0%III)88I&&;I;II1183773"I::IIIII%IEEAA7III..I4IIII*-/III/-*5IIIII,,2I2==0-IIII6$I5775IIIIII==I,4I9C9I..IIIIII44I22;;>D>II>>I8=8I++III6I9DD9IIII25249I9I4IIIII;;II9C9I5II7IIIII54IIIIIII9I9III.I.IIII5I5IIIII9IIIII;II0550*I*II6I*II*III2IIIIIII(II#9III&455   SA:Z:chr3,91329203,+,144S27M1D21M1D48M1D9M1D14M62S,18,5;    RG:Z:BC94   AS:i:115
0109640289  16  chr1    173652900   60  52S44M1D147M96S *   0   0   CTCTCAGCTCTCTATGGGCAGTCGGTGACTCAGACGTGTGCTCTTCCGATCTTTGACTGATGGTAGTTTGTATTTCTGTGGGATCGGTGGTGACATCCCTTTATCATTTTTTATTGTGTCTATTTGATTCTTCTCTCTTTTCTTCTTTATTAGTCTTGTTAGCAGTCTATCAATTTTGTGGATCTTTGCAAAAAAACAAGTCCTGTATTCATTGATTTTTTTGAAGGGTTTTTTGTGTCTCTACACCTCTACACAAATAAACTAGAAAATCTAGAATAAATGGATAAATTCCTGGACACATACATCCTCCCAAGACAGATCGGAAGAGCGTCGTGTAGG ,039II$9'9;;;7I9<92I+2:88,IIII*8+:3I2III*55//118I&I/I/IIIIII3CC2II<I<III797IIII:>:I4I??I<<IIIII:%4%6I6I.II-4774-I==IIIIIIIFIFII88IIIII3II1II1I33ICICI//IIIIIIIIIIIIIIIIIIIIII8II8IIEEIIIIIIII*/4I4/*I>>IIIIIIIIEE:IIIII+-4I4-+I336I603II30IIIIIIIIIIIHH+IIIIIIIIIIIIIIIII=II=IIIIIAAIIIIIIIIIGIGAAEEIIIIII9IIIIIIIII*I*IIII7IIIIIEEHHII+II,6III-877 SA:Z:chr1,173652818,+,23S74M242S,12,0;  RG:Z:BC94   AS:i:184
0140761466  16  chr1    173652901   60  154S137M1I19M24S    *   0   0   GTGCCTTGGCAGTCTCAGCTCTCTATGGGCAGTCGGTGATCAGACGTGTGCTCTCCGATCTTGAAGAGATTGTCCTTTCCCAATATACGTCCTAGCTACCTTTGTCAAAAAAACAGTTCGCTGTAGATGTATAGGTTTATTTCTGGGTTCTCCATGACTGATGGTAGTTTGTATTTCTGTGGGATCGGTGGTGACATCCCCTTTATCATTTTTTATTGTGTCTATTTGATTCTTCTCTCTTTTCTTCTTTATTAGTCTTGTTAGCAGTCTATCAATTTTGTGGATCTTTGCAAAAAAAACAAGTCCTGTATAGATCGGAAGAGCGTCGTGTAGGA &1433IICC-IIIIIII6IIIIIIII9I93IIII99III;III.IIIIIIIII'44II2IIIIIIIIII--I,66;I;'C'IIIIII+I,//IIIIII99IIIII8(,9I9,(8II88IIIII8IIIII3IIIII)I)IGIGII:I:::IIEEIIIIIIIIIII1IIIIIIII@I@IIIIDIDIIIIII;;IIIIII2222=I=IIII09II90IIIIIIIIIIIIIII77IIIIIIII;II;IIII9I9IIIIIIIIIIIIIIIIIIIIIIIII6II6II4497I?I?II-/IIII/-+==IIIIII8IIIIIII??IIII#II,IIIII*112 SA:Z:chr1,173776665,-,60S18M1D11M1D7M1I58M180S,60,3;    RG:Z:BC94   AS:i:149

Here's the command-line that was used (there's no VCF...I'm not sure what to make of your vcf comment...but perhaps you are referring to all the bed files)

/usr/gitc/VerifyBamID \
--Verbose \
--NumPC 4 \
--Output out_file \
--BamFile input_file.aligned.sorted.duplicates_marked.bam \
--Reference /cromwell_root/gcp-public-data--broad-references/hg38/v0/Homo_sapiens_assembly38.fasta \
--UDPath /cromwell_root/gcp-public-data--broad-references/hg38/v0/Homo_sapiens_assembly38.contam.UD \
--MeanPath /cromwell_root/gcp-public-data--broad-references/hg38/v0/Homo_sapiens_assembly38.contam.mu \
--BedPath /cromwell_root/gcp-public-data--broad-references/hg38/v0/Homo_sapiens_assembly38.contam.bed \
1>/dev/null

The resources files are from gs://gcp-public-data--broad-references and are on the correct reference...this exact pipeline works for many samples....just crashes on this one input file.

hyunminkang commented 3 years ago

It seems that the code ignores all reads not properly paired.. and there is no option to disable it.

@Griffan - could you confirm? Is there a way to fix without modifying the source?

Hyun.

Hyun Min Kang, Ph.D. Associate Professor of Biostatistics University of Michigan, Ann Arbor Email : hmkang@umich.edu

On Fri, Nov 20, 2020 at 5:45 PM Yossi Farjoun notifications@github.com wrote:

Thanks for getting back so quickly @hyunminkang https://github.com/hyunminkang

here's a snippet of reads over one of the snps:

0074087028 16 chr1 173652816 50 176S104M22S 0 0 GTATTGCTGCTATTGTCTGCTATTGATTGTGATATGTCTGTCTGATACTGCTATAGTACTGCTATGTGTCCCTATTCCCCCCTGTGTTGCCCTTGGGCCCAGTCTTCCAGGCTCTCTATGGGGCCAGTCGGTGATCAGACGTGTGCTCTTCCGATCTTGTCTCGTTGATCTGTCTATAGTCTTGGGAGGATGTATGTGTCCAGGAATTTATCCATTTATTCTAGATTTTCTAGTTTATTTGTGTAGAGGTGTTTATAGTATTGACTGATGGTAGTTTGTAGATCGGAAGAGCGTCGTGTAGG '(1(-.%'.'/)*)+(-/+++1.2;)3/+0.0/195I205/2.557:I5I4II2IIIII:IIIII-,-4+22&.>>.&%.&+55"(3(==:::/7/)$1I//33I;;7I4/II95;;511II9#@@.2I9(III,III68IIIII))7IIIIIIII/IIIIIIIIIIIIIIIIIIIIIFIFI==IIIIIIIIII,,IAAIIIIIIIBBIIIIIIIIIIII.II.IIII9I9ICICIIIII;I<<IIDIDIIIIIIIIIIII9I/IIIII2I2IIIIIII22IIII$2I,II:I6588 RG:Z:BC94 AS:i:104 0038515860 16 chr1 173652836 60 180S75M1D47M23S 0 0 GTGCCTTGGCAGTCTCAGCTCTCTATGGGCAGTCGTGATCAGACGTGTGGCTCTTCGATCTACCAGAAATTATAAACTACAAGAATCATAGGCAAAATCTTCATGACATTGGTCTTGGTGATGATTTTTTTTTCAGACTCCAAAGGCACAGGTACAAAAGCACAGTTAGACAAATGGCATTGTCCAGGAATTTATCCATTTATTCTAGATTTTCTAGTTTATTTGTGTAGAGGTGTTTATAGTATTGACTGATGGAGTTTGTATTTCTGTGGGATCGGTGGTGACATCCCCTTTATCATTTTAGATCGGAAGAGCGTCGTGTAGG 33..7766,II5/3I8IIIIIIII595&I.I;")7I2'I9+&7III++I"::$5I#+8---.I.4.11:"%0%III)88I&&;I;II1183773"I::IIIII%IEEAA7III..I4IIII-/III/-5IIIII,,2I2==0-IIII6$I5775IIIIII==I,4I9C9I..IIIIII44I22;;>D>II>>I8=8I++III6I9DD9IIII25249I9I4IIIII;;II9C9I5II7IIIII54IIIIIII9I9III.I.IIII5I5IIIII9IIIII;II0550III6IIIIII2IIIIIII(II#9III&455 SA:Z:chr3,91329203,+,144S27M1D21M1D48M1D9M1D14M62S,18,5; RG:Z:BC94 AS:i:115 0109640289 16 chr1 173652900 60 52S44M1D147M96S 0 0 CTCTCAGCTCTCTATGGGCAGTCGGTGACTCAGACGTGTGCTCTTCCGATCTTTGACTGATGGTAGTTTGTATTTCTGTGGGATCGGTGGTGACATCCCTTTATCATTTTTTATTGTGTCTATTTGATTCTTCTCTCTTTTCTTCTTTATTAGTCTTGTTAGCAGTCTATCAATTTTGTGGATCTTTGCAAAAAAACAAGTCCTGTATTCATTGATTTTTTTGAAGGGTTTTTTGTGTCTCTACACCTCTACACAAATAAACTAGAAAATCTAGAATAAATGGATAAATTCCTGGACACATACATCCTCCCAAGACAGATCGGAAGAGCGTCGTGTAGG ,039II$9'9;;;7I9<92I+2:88,IIII8+:3I2III55//118I&I/I/IIIIII3CC2II<I:I4I??I<<IIIII:%4%6I6I.II-4774-I==IIIIIIIFIFII88IIIII3II1II1I33ICICI//IIIIIIIIIIIIIIIIIIIIII8II8IIEEIIIIIIII/4I4/I>>IIIIIIIIEE:IIIII+-4I4-+I336I603II30IIIIIIIIIIIHH+IIIIIIIIIIIIIIIII=II=IIIIIAAIIIIIIIIIGIGAAEEIIIIII9IIIIIIIIIIIIII7IIIIIEEHHII+II,6III-877 SA:Z:chr1,173652818,+,23S74M242S,12,0; RG:Z:BC94 AS:i:184 0140761466 16 chr1 173652901 60 154S137M1I19M24S 0 0 GTGCCTTGGCAGTCTCAGCTCTCTATGGGCAGTCGGTGATCAGACGTGTGCTCTCCGATCTTGAAGAGATTGTCCTTTCCCAATATACGTCCTAGCTACCTTTGTCAAAAAAACAGTTCGCTGTAGATGTATAGGTTTATTTCTGGGTTCTCCATGACTGATGGTAGTTTGTATTTCTGTGGGATCGGTGGTGACATCCCCTTTATCATTTTTTATTGTGTCTATTTGATTCTTCTCTCTTTTCTTCTTTATTAGTCTTGTTAGCAGTCTATCAATTTTGTGGATCTTTGCAAAAAAAACAAGTCCTGTATAGATCGGAAGAGCGTCGTGTAGGA &1433IICC-IIIIIII6IIIIIIII9I93IIII99III;III.IIIIIIIII'44II2IIIIIIIIII--I,66;I;'C'IIIIII+I,//IIIIII99IIIII8(,9I9,(8II88IIIII8IIIII3IIIII)I)IGIGII:I:::IIEEIIIIIIIIIII1IIIIIIII@I@IIIIDIDIIIIII;;IIIIII2222=I=IIII09II90IIIIIIIIIIIIIII77IIIIIIII;II;IIII9I9IIIIIIIIIIIIIIIIIIIIIIIII6II6II4497I?I?II-/IIII/-+==IIIIII8IIIIIII??IIII#II,IIIII*112 SA:Z:chr1,173776665,-,60S18M1D11M1D7M1I58M180S,60,3; RG:Z:BC94 AS:i:149

Here's the command-line that was used (there's no VCF...I'm not sure what to make of your vcf comment...but perhaps you are referring to all the bed files)

/usr/gitc/VerifyBamID \ --Verbose \ --NumPC 4 \ --Output out_file \ --BamFile input_file.aligned.sorted.duplicates_marked.bam \ --Reference /cromwell_root/gcp-public-data--broad-references/hg38/v0/Homo_sapiens_assembly38.fasta \ --UDPath /cromwell_root/gcp-public-data--broad-references/hg38/v0/Homo_sapiens_assembly38.contam.UD \ --MeanPath /cromwell_root/gcp-public-data--broad-references/hg38/v0/Homo_sapiens_assembly38.contam.mu \ --BedPath /cromwell_root/gcp-public-data--broad-references/hg38/v0/Homo_sapiens_assembly38.contam.bed \ 1>/dev/null

The resources files are from gs://gcp-public-data--broad-references and are on the correct reference...this exact pipeline works for many samples....just crashes on this one input file.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Griffan/VerifyBamID/issues/23#issuecomment-731445362, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABPY5ONM42UXGKSCJXKQO5TSQ3WQZANCNFSM4T5H2HAQ .

Griffan commented 3 years ago

@yfarjoun and @hyunminkang

"It seems that the code ignores all reads not properly paired.. and there is no option to disable it."

From the source code you can see VB2 doesn't filter reads by pair-end status. I tested this on single-end low-coverage bam files from 1000g which gives expected results. (There is a option --OutputPileup for debugging purpose, could you confirm if you can see any bases there?)

As for soft-clip, the CIGAR string are all handled by htslib api, VB2 takes over from returned pileup structures returned by mpileup api, which shouldn't be the problem but I will keep looking.

Also, in VB2 code there is no warning message:

"Found zero likelihoods. Bam is either very-very shallow, or aligned to the wrong reference (relative to the vcf)"

Could you post your executable's version number? Could you also confirm that this is the only sample that failed with this "single-end long reads" setting, among other samples with the same type of data setting?

Thanks!

hyunminkang commented 3 years ago

Thanks @Grffian for the clarification. Sorry for the false alarm.

Hyun Min Kang, Ph.D. Associate Professor of Biostatistics University of Michigan, Ann Arbor Email : hmkang@umich.edu

On Sat, Nov 21, 2020 at 4:47 AM Griffan(Fan Zhang) notifications@github.com wrote:

@yfarjoun https://github.com/yfarjoun and @hyunminkang https://github.com/hyunminkang

"It seems that the code ignores all reads not properly paired.. and there is no option to disable it."

From the source code you can see VB2 doesn't filter reads by pair-end status. I tested this on single-end low-coverage bam files from 1000g which gives expected results. (There is a option --OutputPileup for debugging purpose, could you confirm if you can see any bases there?)

As for soft-clip, the CIGAR string are all handled by htslib api, VB2 takes over from returned pileup structures returned by mpileup api, which shouldn't be the problem but I will keep looking.

Also, in VB2 code there is no warning message:

"Found zero likelihoods. Bam is either very-very shallow, or aligned to the wrong reference (relative to the vcf)"

Could you post your executable's version number? Could you also confirm that this is the only sample that failed with this "single-end long reads" setting, among other samples with the same type of data setting?

Thanks!

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Griffan/VerifyBamID/issues/23#issuecomment-731555592, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABPY5ON5EGNIGDJTBMEMR43SQ6ECPANCNFSM4T5H2HAQ .

yfarjoun commented 3 years ago

right...sorry about that. the error message I quoted was from the downstream part of the script and I got confused. That message however is triggered by both LK0 and LK1 being equal to zero, which implies, I think, that there is no evidence for anything, however, there are clearly reads present over the SNPs and so I'm still confused as to what is happening.

I updated the title and description of this issue to reflect this.

Griffan commented 3 years ago

You mentioned in your first comment that VB2 crashes on this sample, could you post the log info printed by VB2?

yfarjoun commented 3 years ago

I corrected my post. VBID doesn't crash, but it reports no coverage in the report,

Griffan commented 3 years ago

Hi, @yfarjoun Having looked into this problem further, I found out that it's due to the function of --adjust-MQ option defined by mpileup function.

-C, --adjust-MQ INT

Coefficient for downgrading mapping quality for reads containing excessive mismatches. Given a read with a phred-scaled probability q of being generated from the mapped position, the new mapping quality is about sqrt((INT-q)/INT)*INT. A zero value disables this functionality; if enabled, the recommended value for BWA is 50.

By default, VB2 enabled this function following the recommendation of the use case of BWA, and also I didn't take into account the scenarios where reads have unusually long soft-clips. The formula inside this function penalizes soft-clips heavily to the extent that even if there is no mismatches the adjusted MQ will be larger than the threshold and directly discard the read.

I have pushed a commit to move this option to the cmdline. Now you can specify "--adjust-MQ 0" to disable this functionality.

yfarjoun commented 3 years ago

Thanks! I’ll test this out.

On Thu, Nov 26, 2020 at 1:13 PM Griffan(Fan Zhang) notifications@github.com wrote:

Hi, @yfarjoun https://github.com/yfarjoun Having looked into this problem further, I found out that it's due to the function of --adjust-MQ option defined by mpileup function.

-C, --adjust-MQ INT

Coefficient for downgrading mapping quality for reads containing excessive mismatches. Given a read with a phred-scaled probability q of being generated from the mapped position, the new mapping quality is about sqrt((INT-q)/INT)*INT. A zero value disables this functionality; if enabled, the recommended value for BWA is 50.

By default, VB2 enabled this function following the recommendation of the use case of BWA, and also I didn't take into account the scenarios where reads have unusually long soft-clips. The formula inside this function penalizes soft-clips heavily to the extent that even if there is no mismatches the adjusted MQ will be larger than the threshold and directly discard the read.

I have pushed a commit to move this option to the cmdline. Now you can specify "--adjust-MQ 0" to disable this functionality.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Griffan/VerifyBamID/issues/23#issuecomment-734431862, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAU6JUQCPSON4ZUEPVU6KYDSR2LD5ANCNFSM4T5H2HAQ .