chollenbeck / rad_haplotyper

MIT License
7 stars 5 forks source link

Potentially erroneous "failed" haplotyping calls #31

Open cbird808 opened 5 years ago

cbird808 commented 5 years ago

I’ve been noticing the following issues with radhaplotyper for a while and am finally getting around to asking about it. I’ve listed one example for each of several different undesirable decisions made by radhaplotyper (in my opinion) with the explanation by rad hap.

My settings are:

perl $RADHAP_SCRIPT -v ${Indexing}.vcf -x 1 -e -d ${THRESHOLDa} -mp ${THRESHOLDb} -u ${THRESHOLDc} -ml ${THRESHOLDd} -h ${THRESHOLDe} -z ${THRESHOLDf} -m ${THRESHOLDg} -r ${REF_FILE} -bp ${BAM_PATH} -p ${PopMap}$CutoffCode -o ${VCF_OUT}.Fltr${FILTER_ID}.Haplotyped.vcf #-g ${VCF_OUT}.Fltr${FILTER_ID}.${PopMap##/}.haps.genepop -a ${VCF_OUT}.Fltr${FILTER_ID}.${PopMap##/}.haps.ima

where the THRESHOLD variables are set, in order, according to: 19 rad_haplotyper -d 50
19 rad_haplotyper -mp 10
19 rad_haplotyper -u 30
19 rad_haplotyper -ml 10
19 rad_haplotyper -h 100
19 rad_haplotyper -z 0.2
19 rad_haplotyper -m 0.6

I added the -bp option to radhaplotyper for convenience. I’m also wrapping this into a gnu parallel command that divides the vcf into $NumProc pieces (by row), thus the -x 1 option. The rest should be self-explanatory.

*Note: I did branch this in github then sent a req for it to be merged into the official radhaplotyper, but it was not. branched v 1.1.9

HOMOZYG with 1 alt read REJECTED due to detection of no haps dDocent_Contig_11076: Observed Haps: $VAR1 = { 'CTCTCTCGGCATGATATAT' => 11, 'CTCTCTCGGCAGGATATAT' => 1 }; dDocent_Contig_11076: Unique Observed Haps: $VAR1 = []; dDocent_Contig_11076: Unable to rescue dDocent_Contig_11076 $VAR1 = { 'CTCTCTCGGCAGGATATAT' => 1, 'CTCTCTCGGCATGATATAT' => 11 }; Expected haplotypes: 1 Observed haplotypes: 0 Failed again...

HOMOZYG with 1 alt read REJECTED due to expectation of HETEROZYG: dDocent_Contig_11314: Unable to rescue dDocent_Contig_11314 $VAR1 = { 'ATCGAATCTGTCCAG' => 1, 'ATTGAATCTGTCCGG' => 14 }; Expected haplotypes: 2 Observed haplotypes: 1 Failed, trying to recover...

HOMOZYG with more noise REJECTED due to expectation of HETEROZYG: dDocent_Contig_11383: Unable to rescue dDocent_Contig_11383 $VAR1 = { 'TCATATATAGGTGTTTGA' => 1, 'TCATATAAAGGTGTTTGA' => 1, 'TCATATAGAGGTGGTTGA' => 1, 'TCATATAGCGGTGTTTGA' => 1, 'TCATATAGAGGTGTTTGT' => 2, 'TCATATAGAGGTGTTTGA' => 61 }; Expected haplotypes: 2 Observed haplotypes: 1 Failed again...

HOMOZYG low cov REJECTED due to expectation of HETEROZYG: dDocent_Contig_10963: Observed Haps: $VAR1 = { 'CAGGATGATTCTAGTCCTGCAAGTCCT' => 5 }; dDocent_Contig_10963: Unique Observed Haps: $VAR1 = [ 'CAGGATGATTCTAGTCCTGCAAGTCCT' ]; dDocent_Contig_10963: Unable to rescue dDocent_Contig_10963 $VAR1 = { 'CAGGATGATTCTAGTCCTGCAAGTCCT' => 5 }; Expected haplotypes: 2 Observed haplotypes: 1 Failed again...

HETEROZYG low cov called HOMOZYG: dDocent_Contig_11076: Observed Haps: $VAR1 = { 'CTCCCTCGGCATGATATAT' => 3, 'CCTCCTCGGCATAATATAT' => 3 }; dDocent_Contig_11076: Unique Observed Haps: $VAR1 = [ 'CTCCCTCGGCATGATATAT' ]; dDocent_Contig_11076: Unable to rescue

HETEROZYG low cov w/ noise called TRIPLOID: dDocent_Contig_11371: Observed Haps: $VAR1 = { 'GCGCACAGGGTGGTACGAG' => 4, 'TCGCACAGGGTAGTACGAG' => 4, 'GCGCACAGGGTAGTACGAG' => 1 }; dDocent_Contig_11371: Unique Observed Haps: $VAR1 = [ 'GCGCACAGGGTAGTACGAG', 'TCGCACAGGGTAGTACGAG', 'GCGCACAGGGTGGTACGAG' ]; dDocent_Contig_11371: Problem- trying to fix... dDocent_Contig_11371: haplotype count threshold:1.8 dDocent_Contig_11371: Corrected Unique Observed Haps: $VAR1 = [ 'GCGCACAGGGTGGTACGAG', 'GCGCACAGGGTAGTACGAG', 'TCGCACAGGGTAGTACGAG' ]; dDocent_Contig_11371: Unable to rescue dDocent_Contig_11371 $VAR1 = { 'GCGCACAGGGTGGTACGAG' => 4, 'GCGCACAGGGTAGTACGAG' => 1, 'TCGCACAGGGTAGTACGAG' => 4 }; Expected haplotypes: 2 Observed haplotypes: 3 Failed again...

HETEROZYG w/ noise called TRIPLOID dDocent_Contig_11076: Observed Haps: $VAR1 = { 'CTCCCTCGGCATGATATAT' => 8, 'TCTACTGGGCATGATGCTC' => 9, 'TCTAGTGGGGTTGATGCTC' => 2 }; dDocent_Contig_11076: Unique Observed Haps: $VAR1 = [ 'CTCCCTCGGCATGATATAT', 'TCTACTGGGCATGATGCTC', 'TCTAGTGGGGTTGATGCTC' ]; dDocent_Contig_11076: Problem- trying to fix... dDocent_Contig_11076: haplotype count threshold:3.8 dDocent_Contig_11076: Corrected Unique Observed Haps: $VAR1 = [ 'CTCCCTCGGCATGATATAT', 'TCTAGTGGGGTTGATGCTC', 'TCTACTGGGCATGATGCTC' ]; dDocent_Contig_11076: Unable to rescue dDocent_Contig_11076 $VAR1 = { 'TCTAGTGGGGTTGATGCTC' => 2, 'TCTACTGGGCATGATGCTC' => 9, 'CTCCCTCGGCATGATATAT' => 8 }; Expected haplotypes: 2 Observed haplotypes: 3 Failed again...

HETEROZYG w/ noise called HOMOZYG: dDocent_Contig_11002: Observed Haps: $VAR1 = { 'GAGGGACCGCCGAACAGCTCTTCGTCTCG' => 17, 'GAGGGCCCGCAGAACGACTCTTCGTCGCG' => 1, 'GAGGGCCCGCAGAACGGCTCTTCGTCGCG' => 20, 'GAGGGACCGCCGAACAACTCTTCGTCTCG' => 1, 'GAGGGACCGCCGAACAGCTCTTCGGCTCG' => 1 }; dDocent_Contig_11002: Unique Observed Haps: $VAR1 = [ 'GAGGGCCCGCAGAACGGCTCTTCGTCGCG' ]; dDocent_Contig_11002: Unable to rescue dDocent_Contig_11002 $VAR1 = { 'GAGGGCCCGCAGAACGACTCTTCGTCGCG' => 1, 'GAGGGACCGCCGAACAGCTCTTCGTCTCG' => 17, 'GAGGGACCGCCGAACAACTCTTCGTCTCG' => 1, 'GAGGGCCCGCAGAACGGCTCTTCGTCGCG' => 20, 'GAGGGACCGCCGAACAGCTCTTCGGCTCG' => 1 }; Expected haplotypes: 2 Observed haplotypes: 1 Failed, trying to recover...

HETEROZYG called HOMOZYG: dDocent_Contig_11132: Observed Haps: $VAR1 = { 'TCAAGC' => 14, 'GAGAGA' => 13 }; dDocent_Contig_11132: Unique Observed Haps: $VAR1 = [ 'TCAAGC' ]; dDocent_Contig_11132: Unable to rescue dDocent_Contig_11132 $VAR1 = { 'GAGAGA' => 13, 'TCAAGC' => 14 }; Expected haplotypes: 2 Observed haplotypes: 1 Failed again...

chollenbeck commented 5 years ago

Hi Chris,

Is there any chance you could isolate some of these into a small BAM file? If you could send that along with the VCF, I can take a look at these. It may be the case for some of the examples - especially when Observed Haps and Unique Observed Haps are different- it is failing because some of the haplotypes observed are not possible given the genotypes (rad_haplotyper assumes that the genotypes in the VCF file are correct, and excludes any haplotypes that are not possible based on that assumption). I'd need to take a closer look at the VCF and BAM files to tell, though.