Open brentp opened 9 years ago
Great, that sounds like a good idea, happy to help where I can
Thanks again for all your help on this
Chris
I have started this here: https://gist.github.com/brentp/4db670df147cbd5a2b32 @cboustred let me know if you have any problems with that script.
Hi Brent, apologies for the slow reply, been on A/L.
This is great, thank you so much for spending time on this. I hope I can learn some neat Python by reading your code!
I've tested out the preprocess.py script on a couple of VarScan and Platypus VCFs and have hit the following errors:
python preprocess.py --depth NR --alt-depth NV platypus_example.vcf
...
Traceback (most recent call last):
File "preprocess.py", line 132, in
and for VarScan:
python preprocess.py --ref-depth RD --alt-depth AD varscan_example.vcf
...
Traceback (most recent call last):
File "preprocess.py", line 132, in
For VarScan it looks like it is related when there are no-call genotypes. When this happens the format of the sample genotypes do not match the format style!
For Platypus it looks like its related to the processing of multi-allelic sites...
I will email you with some minimal VCFs that should replicate the above errors as I don't seem to be able to attached them here. I will also spend some time looking over the code and seeing if I can spot a fix
Thanks again for your help with this,
Chris
Thanks for the great test-cases. I fixed the varscan one and updated the gist. I don't understand how to handle the platypus error. Here is the line:
chr1
21016838
.
TG
GG,TGG
37
HapScore;QD
BRF=0.04;FR=0.2942,0.2942;HP=10;HapScore=7;MGOF=24;MMLQ=15;MQ=58.89;NF=25,0;NR=389,21;PP=32,37;QD=0.161283122095;SC=CCTGGGGGGGTGGGAGGGACC;SbPval=1.0;Source=Platypus;TC=935;TCF=395;TCR=540;TR=414,21;WE=21016850;WS=21016828
GT:GL:GOF:GQ:NR:NV
1/0:-1,-1,-1:13:3:80,76:37,1
1/0:-1,-1,-1:24:3:69,65:30,1
1/0:-1,-1,-1:10:3:68,60:26,3
1/0:-1,-1,-1:23:3:72,69:43,0
1/0:-1,-1,-1:21:3:76,70:30,4
1/0:-1,-1,-1:18:3:109,98:46,4
1/0:-1,-1,-1:14:3:85,78:37,2
1/0:-1,-1,-1:11:3:77,71:34,2
1/0:-1,-1,-1:9:3:69,63:35,0
1/0:-1,-1,-1:15:3:79,77:38,1
1/0:-1,-1,-1:23:3:75,69:29,2
1/0:-1,-1,-1:14:3:76,61:29,1
where NR=Number of reads covering variant location in this sample
and NV=Number of reads containing variant in this sample
So, IIUC the NR's are different because they have different length(?). I need to choose a value for the total read coverage at the variant site. As a hack, I can take the average, or the one with length closest to the length of the REF allele, but I don't think there is an obvious way to handle that. Any ideas?
If you have a platypus file with more multiple alts, that'd be great. thanks again.
Thanks, apologies not sure if I understand...
From my understanding, NR is the 'total read coverage at the variant site' for a specific sample (i.e. total depth).
NV is the number of reads that are supporting the variant allele. Total reads supporting the reference allele therefore = NR - NV
The above example is multi allelic so there are two NRs and two NVs for each allele listed in the alt column.
Does that make sense or have I misunderstood your question?
Thanks again,
Chris
On 10 April 2015 at 20:11, Brent Pedersen - Bioinformatics < notifications@github.com> wrote:
Thanks for the great test-cases. I fixed the varscan one and updated the gist. I don't understand how to handle the platypus error. Here is the line:
chr1 21016838 . TG GG,TGG 37 HapScore;QD BRF=0.04;FR=0.2942,0.2942;HP=10;HapScore=7;MGOF=24;MMLQ=15;MQ=58.89;NF=25,0;NR=389,21;PP=32,37;QD=0.161283122095;SC=CCTGGGGGGGTGGGAGGGACC;SbPval=1.0;Source=Platypus;TC=935;TCF=395;TCR=540;TR=414,21;WE=21016850;WS=21016828 GT:GL:GOF:GQ:NR:NV 1/0:-1,-1,-1:13:3:80,76:37,1 1/0:-1,-1,-1:24:3:69,65:30,1 1/0:-1,-1,-1:10:3:68,60:26,3 1/0:-1,-1,-1:23:3:72,69:43,0 1/0:-1,-1,-1:21:3:76,70:30,4 1/0:-1,-1,-1:18:3:109,98:46,4 1/0:-1,-1,-1:14:3:85,78:37,2 1/0:-1,-1,-1:11:3:77,71:34,2 1/0:-1,-1,-1:9:3:69,63:35,0 1/0:-1,-1,-1:15:3:79,77:38,1 1/0:-1,-1,-1:23:3:75,69:29,2 1/0:-1,-1,-1:14:3:76,61:29,1
where NR=Number of reads covering variant location in this sample
and NV=Number of reads containing variant in this sample
So, IIUC the NR's are different because they have different length(?). I need to choose a value for the total read coverage at the variant site. As a hack, I can take the average, or the one with length closest to the length of the REF allele, but I don't think there is an obvious way to handle that. Any ideas?
If you have a platypus file with more multiple alts, that'd be great. thanks again.
— Reply to this email directly or view it on GitHub https://github.com/arq5x/gemini/issues/409#issuecomment-91655155.
add a script like:
and likewise for varscan and other common options.