arq5x / gemini

a lightweight db framework for exploring genetic variation.
http://gemini.readthedocs.org
MIT License
318 stars 120 forks source link

support non-freebayes/GATK callers by preprocessing #409

Open brentp opened 9 years ago

brentp commented 9 years ago

add a script like:

preprocess_vcf -v my.platypus.vcf --dp TC > my.dp.platypus.vcf

and likewise for varscan and other common options.

cboustred commented 9 years ago

Great, that sounds like a good idea, happy to help where I can

Thanks again for all your help on this

Chris

brentp commented 9 years ago

I have started this here: https://gist.github.com/brentp/4db670df147cbd5a2b32 @cboustred let me know if you have any problems with that script.

cboustred commented 9 years ago

Hi Brent, apologies for the slow reply, been on A/L.

This is great, thank you so much for spending time on this. I hope I can learn some neat Python by reading your code!

I've tested out the preprocess.py script on a couple of VarScan and Platypus VCFs and have hit the following errors:

python preprocess.py --depth NR --alt-depth NV platypus_example.vcf ... Traceback (most recent call last): File "preprocess.py", line 132, in main(sys.argv[1:]) File "preprocess.py", line 127, in main preprocess(a.vcf, a.ref_depth, a.alt_depth, a.depth) File "preprocess.py", line 104, in preprocess [set_depths(s, ref_i, alt_i, dp_i, depth != "DP") for s in samples] File "preprocess.py", line 39, in set_depths assert len(set(dp.split(","))) == 1 AssertionError

and for VarScan:

python preprocess.py --ref-depth RD --alt-depth AD varscan_example.vcf ... Traceback (most recent call last): File "preprocess.py", line 132, in main(sys.argv[1:]) File "preprocess.py", line 127, in main preprocess(a.vcf, a.ref_depth, a.alt_depth, a.depth) File "preprocess.py", line 104, in preprocess [set_depths(s, ref_i, alt_i, dp_i, depth != "DP") for s in samples] File "preprocess.py", line 31, in set_depths depths.append(sample[ref_i]) IndexError: list index out of range

For VarScan it looks like it is related when there are no-call genotypes. When this happens the format of the sample genotypes do not match the format style!

For Platypus it looks like its related to the processing of multi-allelic sites...

I will email you with some minimal VCFs that should replicate the above errors as I don't seem to be able to attached them here. I will also spend some time looking over the code and seeing if I can spot a fix

Thanks again for your help with this,

Chris

brentp commented 9 years ago

Thanks for the great test-cases. I fixed the varscan one and updated the gist. I don't understand how to handle the platypus error. Here is the line:

chr1
21016838
.
TG
GG,TGG
37
HapScore;QD
BRF=0.04;FR=0.2942,0.2942;HP=10;HapScore=7;MGOF=24;MMLQ=15;MQ=58.89;NF=25,0;NR=389,21;PP=32,37;QD=0.161283122095;SC=CCTGGGGGGGTGGGAGGGACC;SbPval=1.0;Source=Platypus;TC=935;TCF=395;TCR=540;TR=414,21;WE=21016850;WS=21016828
GT:GL:GOF:GQ:NR:NV
1/0:-1,-1,-1:13:3:80,76:37,1
1/0:-1,-1,-1:24:3:69,65:30,1
1/0:-1,-1,-1:10:3:68,60:26,3
1/0:-1,-1,-1:23:3:72,69:43,0
1/0:-1,-1,-1:21:3:76,70:30,4
1/0:-1,-1,-1:18:3:109,98:46,4
1/0:-1,-1,-1:14:3:85,78:37,2
1/0:-1,-1,-1:11:3:77,71:34,2
1/0:-1,-1,-1:9:3:69,63:35,0
1/0:-1,-1,-1:15:3:79,77:38,1
1/0:-1,-1,-1:23:3:75,69:29,2
1/0:-1,-1,-1:14:3:76,61:29,1

where NR=Number of reads covering variant location in this sample

and NV=Number of reads containing variant in this sample

So, IIUC the NR's are different because they have different length(?). I need to choose a value for the total read coverage at the variant site. As a hack, I can take the average, or the one with length closest to the length of the REF allele, but I don't think there is an obvious way to handle that. Any ideas?

If you have a platypus file with more multiple alts, that'd be great. thanks again.

cboustred commented 9 years ago

Thanks, apologies not sure if I understand...

From my understanding, NR is the 'total read coverage at the variant site' for a specific sample (i.e. total depth).

NV is the number of reads that are supporting the variant allele. Total reads supporting the reference allele therefore = NR - NV

The above example is multi allelic so there are two NRs and two NVs for each allele listed in the alt column.

Does that make sense or have I misunderstood your question?

Thanks again,

Chris

On 10 April 2015 at 20:11, Brent Pedersen - Bioinformatics < notifications@github.com> wrote:

Thanks for the great test-cases. I fixed the varscan one and updated the gist. I don't understand how to handle the platypus error. Here is the line:

chr1 21016838 . TG GG,TGG 37 HapScore;QD BRF=0.04;FR=0.2942,0.2942;HP=10;HapScore=7;MGOF=24;MMLQ=15;MQ=58.89;NF=25,0;NR=389,21;PP=32,37;QD=0.161283122095;SC=CCTGGGGGGGTGGGAGGGACC;SbPval=1.0;Source=Platypus;TC=935;TCF=395;TCR=540;TR=414,21;WE=21016850;WS=21016828 GT:GL:GOF:GQ:NR:NV 1/0:-1,-1,-1:13:3:80,76:37,1 1/0:-1,-1,-1:24:3:69,65:30,1 1/0:-1,-1,-1:10:3:68,60:26,3 1/0:-1,-1,-1:23:3:72,69:43,0 1/0:-1,-1,-1:21:3:76,70:30,4 1/0:-1,-1,-1:18:3:109,98:46,4 1/0:-1,-1,-1:14:3:85,78:37,2 1/0:-1,-1,-1:11:3:77,71:34,2 1/0:-1,-1,-1:9:3:69,63:35,0 1/0:-1,-1,-1:15:3:79,77:38,1 1/0:-1,-1,-1:23:3:75,69:29,2 1/0:-1,-1,-1:14:3:76,61:29,1

where NR=Number of reads covering variant location in this sample

and NV=Number of reads containing variant in this sample

So, IIUC the NR's are different because they have different length(?). I need to choose a value for the total read coverage at the variant site. As a hack, I can take the average, or the one with length closest to the length of the REF allele, but I don't think there is an obvious way to handle that. Any ideas?

If you have a platypus file with more multiple alts, that'd be great. thanks again.

— Reply to this email directly or view it on GitHub https://github.com/arq5x/gemini/issues/409#issuecomment-91655155.