brentp / duphold

don't get DUP'ed or DEL'ed by your putative SVs.
MIT License
101 stars 9 forks source link

DHGT Missing for Manta DEL/DUP and Delly DEL #54

Closed MikeWLloyd closed 8 months ago

MikeWLloyd commented 8 months ago

I am trying to add DHGT to Manta, SMOOVE/Lumpy and Delly calls along with the depth based metrics provided by Duphold. I am able to get DHGT to both Smoove/Lumpy and Delly calls. DHSP is added to Smoove, Lumpy and Manta DEL calls.

I am not seeing DHGT added to Manta DUP calls. Is there something obvious I am missing?

Thanks.


MantaDUP:

9   28671504    MantaDUP:TANDEM:9028:0:1:0:0:0  G   <DUP:TANDEM>    999 PASS    END=28762797;SVTYPE=DUP;SVLEN=91293;GCF=0.394593    GT:FT:GQ:PL:PR:SR:DHFC:DHFFC:DHBFC  0/1:PASS:999:999,0,999:37,19:46,20:2.03448:2.45833:2.03448
9   85566393    MantaDUP:TANDEM:9236:0:1:0:0:0  T   <DUP:TANDEM>    999 PASS    END=85631054;SVTYPE=DUP;SVLEN=64661;GCF=0.400807    GT:FT:GQ:PL:PR:SR:DHFC:DHFFC:DHBFC  0/1:PASS:995:999,0,992:39,20:48,28:2.03448:1.90323:2.03448

MantaDEL:

8   111308271   MantaDEL:8866:0:1:0:0:0 T   <DEL>   999 PASS    END=111311921;SVTYPE=DEL;SVLEN=-3650;GCF=0.425637   GT:FT:GQ:PL:PR:SR:DHFC:DHFFC:DHBFC:DHSP 1/1:PASS:94:999,97,0:0,27:0,20:0:0:0:30
9   11733925    MantaDEL:8967:0:1:0:0:0 C   <DEL>   999 PASS    END=11740933;SVTYPE=DEL;SVLEN=-7008;CIPOS=0,2;CIEND=0,2;HOMLEN=2;HOMSEQ=CA;GCF=0.391497 GT:FT:GQ:PL:PR:SR:DHFC:DHFFC:DHBFC:DHSP 1/1:PASS:117:999,120,0:0,27:0,26:0:0:0:32

Command:

duphold     --threads 12     --output GRCm39_sim_manta.vcf.gz     --vcf GRCm39_sim_manta_sorted.vcf     --bam GRCm39_sim.bam     --fasta Mus_musculus.GRCm39.dna.primary_assembly.fa     --snp GRCm39_sim_SNP_INDEL_filtered_unannotated_final.bcf

Tool version:

  version: 0.2.1

  Usage: duphold [options]

Options:
  -v --vcf <path>           path to sorted SV VCF/BCF
  -b --bam <path>           path to indexed BAM/CRAM
  -f --fasta <path>         indexed fasta reference.
  -s --snp <path>           optional path to snp/indel VCF/BCF with which to annotate SVs. BCF is highly recommended as it's much faster to parse.
  -t --threads <int>        number of decompression threads. [default: 4]
  -o --output <string>      output VCF/BCF (default is VCF to stdout) [default: -]
  -d --drop                 drop all samples from a multi-sample --vcf *except* the sample in --bam. useful for parallelization by sample followed by merge.
  -h --help                 show help
brentp commented 8 months ago

Hi, I don't recall the exact reason, but I have this as a guard in the annotate code:

  if (':' in alt and ('[' in alt or ']' in alt)) or alt == "<INV>": return

I don't think that's causing the problem since it should pass still. It also has:

  if len(snps.starts) == 0: return

but that should also only occur when there are no snps on the entire chromosome of the SV. can you verify that your manta calls (which seem to have no "chr" prefix) match the SNP calls which should also have no "chr" prefix?

MikeWLloyd commented 8 months ago

@brentp thanks for your quick reply.

Here is a sample of calls out of the VCF I am using:

...
1   3184610 .   C   G   547.64  PASS    AC=1;AF=0.500;AN=2;BaseQRankSum=-0.916;DP=29;ExcessHet=0.0000;FS=0.000;MLEAC=1;MLEAF=0.500;MQ=60.00;MQRankSum=0.000;QD=19.56;ReadPosRankSum=-0.636;SOR=0.646    GT:AD:DP:GQ:PL  0/1:11,17:28:99:555,0,336
1   3185681 .   T   C   424.64  PASS    AC=1;AF=0.500;AN=2;BaseQRankSum=-0.964;DP=31;ExcessHet=0.0000;FS=5.384;MLEAC=1;MLEAF=0.500;MQ=60.00;MQRankSum=0.000;QD=13.70;ReadPosRankSum=-0.814;SOR=0.690    GT:AD:DP:GQ:PL  0/1:17,14:31:99:432,0,553
1   3186268 .   G   A   385.64  PASS    AC=1;AF=0.500;AN=2;BaseQRankSum=0.860;DP=35;ExcessHet=0.0000;FS=1.414;MLEAC=1;MLEAF=0.500;MQ=60.00;MQRankSum=0.000;QD=11.69;ReadPosRankSum=-1.327;SOR=1.179 GT:AD:DP:GQ:PL  0/1:20,13:33:99:393,0,656
1   3186863 .   C   G   423.64  PASS    AC=1;AF=0.500;AN=2;BaseQRankSum=0.255;DP=35;ExcessHet=0.0000;FS=1.419;MLEAC=1;MLEAF=0.500;MQ=60.00;MQRankSum=0.000;QD=13.24;ReadPosRankSum=-0.190;SOR=0.368 GT:AD:DP:GQ:PL  0/1:18,14:32:99:431,0,578
1   3187588 .   T   C   739.64  PASS    AC=1;AF=0.500;AN=2;BaseQRankSum=0.423;DP=36;ExcessHet=0.0000;FS=13.720;MLEAC=1;MLEAF=0.500;MQ=60.00;MQRankSum=0.000;QD=20.55;ReadPosRankSum=0.095;SOR=1.886 GT:AD:DP:GQ:PL  0/1:17,19:36:99:747,0,657
...

Given that I use the same VCF for all caller annotations, and it works I am not sure that is the culprit.

Here is the Duphold output:

[duphold] finished
I, [2024-02-13T13:34:59] -- duphold: starting to read snps for chrom: 1
I, [2024-02-13T13:35:01] -- duphold: done reading 146555 bi-allelic snps for chrom: 1
I, [2024-02-13T13:35:10] -- duphold: starting to read snps for chrom: 10
I, [2024-02-13T13:35:11] -- duphold: done reading 96453 bi-allelic snps for chrom: 10
I, [2024-02-13T13:35:20] -- duphold: starting to read snps for chrom: 11
I, [2024-02-13T13:35:21] -- duphold: done reading 89973 bi-allelic snps for chrom: 11
I, [2024-02-13T13:35:29] -- duphold: starting to read snps for chrom: 12
I, [2024-02-13T13:35:30] -- duphold: done reading 86783 bi-allelic snps for chrom: 12
I, [2024-02-13T13:35:38] -- duphold: starting to read snps for chrom: 13
I, [2024-02-13T13:35:39] -- duphold: done reading 87591 bi-allelic snps for chrom: 13
I, [2024-02-13T13:35:48] -- duphold: starting to read snps for chrom: 14
I, [2024-02-13T13:35:49] -- duphold: done reading 88234 bi-allelic snps for chrom: 14
I, [2024-02-13T13:35:57] -- duphold: starting to read snps for chrom: 15
I, [2024-02-13T13:35:58] -- duphold: done reading 76587 bi-allelic snps for chrom: 15
I, [2024-02-13T13:36:04] -- duphold: starting to read snps for chrom: 16
I, [2024-02-13T13:36:05] -- duphold: done reading 72114 bi-allelic snps for chrom: 16
I, [2024-02-13T13:36:12] -- duphold: starting to read snps for chrom: 17
I, [2024-02-13T13:36:13] -- duphold: done reading 69169 bi-allelic snps for chrom: 17
I, [2024-02-13T13:36:19] -- duphold: starting to read snps for chrom: 18
I, [2024-02-13T13:36:20] -- duphold: done reading 66871 bi-allelic snps for chrom: 18
I, [2024-02-13T13:36:24] -- duphold: starting to read snps for chrom: 19
I, [2024-02-13T13:36:25] -- duphold: done reading 44727 bi-allelic snps for chrom: 19
I, [2024-02-13T13:36:37] -- duphold: starting to read snps for chrom: 2
I, [2024-02-13T13:36:39] -- duphold: done reading 133578 bi-allelic snps for chrom: 2
I, [2024-02-13T13:36:50] -- duphold: starting to read snps for chrom: 3
I, [2024-02-13T13:36:51] -- duphold: done reading 118319 bi-allelic snps for chrom: 3
I, [2024-02-13T13:37:02] -- duphold: starting to read snps for chrom: 4
I, [2024-02-13T13:37:04] -- duphold: done reading 112797 bi-allelic snps for chrom: 4
I, [2024-02-13T13:37:14] -- duphold: starting to read snps for chrom: 5
I, [2024-02-13T13:37:16] -- duphold: done reading 109705 bi-allelic snps for chrom: 5
I, [2024-02-13T13:37:26] -- duphold: starting to read snps for chrom: 6
I, [2024-02-13T13:37:27] -- duphold: done reading 110427 bi-allelic snps for chrom: 6
I, [2024-02-13T13:37:37] -- duphold: starting to read snps for chrom: 7
I, [2024-02-13T13:37:38] -- duphold: done reading 101950 bi-allelic snps for chrom: 7
I, [2024-02-13T13:37:47] -- duphold: starting to read snps for chrom: 8
I, [2024-02-13T13:37:48] -- duphold: done reading 94650 bi-allelic snps for chrom: 8
I, [2024-02-13T13:37:57] -- duphold: starting to read snps for chrom: 9
I, [2024-02-13T13:37:58] -- duphold: done reading 92288 bi-allelic snps for chrom: 9

So Duphold is reading SNPs. I am wondering if there is something about the format in the manta VCF of the Manta DUP calls specifically. Also, to be honest with you, I was simply curious why it isn't annotating for just the manta dup calls. Reading your prior posts cautioning using the SNP based filter and what you say in the docs, I am not inclined to use DHGT. However, perhaps someone else would do so for Manta.

MikeWLloyd commented 8 months ago

Actually, it may have to do with how samples are being named in the VCF, BAM, and Manta VCF. I suspect this is a 'me' problem, rather than a code problem. I will close for now, and reopen if needed.