Closed ODiogoSilva closed 8 years ago
I second this request!
This is definitely on the TODO list. The necessary data is kind of scattered throughout the code, so it's a little complicated. And it might be a few weeks before I get around to it. My thought is to use a large dictionary to store this information that could be pickled between steps.
This would need to match the following:
For each locus: (indexed in s7) For each consens seq in that locus (matched to non-indexed loci in cat.consens_.gz): For each read that was clustered into this consens seq (matched in .u files): The genotype score: (1) Read depth (2) Probability of genotype (3) If the majority rule cons base calling is turned on for reads below a certain depth then this should be indicated also.
Do you think users would want a FULL VCF file? or only for variable sites? Currently pyRAD only outputs the latter, but it could do both, thought that would require saving a lot more data.
What do you mean by full VCF file? Including the invariable sequence data around each variable site?
Exactly. Some of the invariable sites could be incorrectly called, and you may want to know if the base call was made when only one allele was present at that site (10/0, P~100%) or when a minor allele came up once or twice (10/1, P~95%), and what the probability (P) was given your estimated error rate and heterozygosity.
I don't have any analyses planned personally where I would envision using quality score data for invariable sites, so my first inclination would be to exclude these, but if this is something that people want to use, or that some software I'm not familiar with makes use of, then I suppose it should be included.
Personally I have no need for that extra information and I'm also not familiar with programs that use that kind of information. I would greatly appreciate the depth and genotype quality or any other relevant piece of information on variable sites though!
ping
Read depth information is now available in the VCF output in ipyrad (http://ipyrad.readthedocs.io). I encourage users to switch over to the new software.
Example:
##fileformat=VCFv4.0
##fileDate=2016/08/11
##source=ipyrad_v.0.3.25
##reference=pseudo-reference (most common base at site)
##phasing=unphased
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=CATG,Number=1,Type=String,Description="Base Counts (CATG)">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT 1A_0 1B_0 1C_0 1D_0 2E_0 2F_0 2G_0 2H_0 3I_0 3J_0 3K_0 3L_0
0 0 . G . 13 PASS NS=12;DP=240 GT:CATG 0/0:0,0,0,16 0/0:0,0,0,20 0/0:0,0,0,19 0/0:0,0,0,24 0/0:0,0,0,22 0/0:0,0,0,18 0/0:0,0,0,21 0/0:0,0,0,19 0/0:0,0,0,18 0/0:0,0,0,21 0/0:0,0,0,21 0/0:0,0,0,21
0 1 . A . 13 PASS NS=12;DP=240 GT:CATG 0/0:0,16,0,0 0/0:0,20,0,0 0/0:0,19,0,0 0/0:0,24,0,0 0/0:0,22,0,0 0/0:0,18,0,0 0/0:0,21,0,0 0/0:0,19,0,0 0/0:0,18,0,0 0/0:0,21,0,0 0/0:0,21,0,0 0/0:0,21,0,0
0 2 . T . 13 PASS NS=12;DP=240 GT:CATG 0/0:0,0,16,0 0/0:0,0,20,0 0/0:0,0,19,0 0/0:0,0,24,0 0/0:0,0,22,0 0/0:0,0,18,0 0/0:0,0,21,0 0/0:0,0,19,0 0/0:0,0,18,0 0/0:0,0,21,0 0/0:0,0,21,0 0/0:0,0,21,0
0 3 . G . 13 PASS NS=12;DP=240 GT:CATG 0/0:0,0,0,16 0/0:0,0,0,20 0/0:0,0,0,19 0/0:0,0,0,24 0/0:0,0,0,22 0/0:0,0,0,18 0/0:0,0,0,21 0/0:0,0,0,19 0/0:0,0,0,18 0/0:0,0,0,21 0/0:0,0,0,21 0/0:0,0,0,21
0 4 . G . 13 PASS NS=12;DP=240 GT:CATG 0/0:0,0,0,16 0/0:0,0,0,20 0/0:0,0,0,19 0/0:0,0,0,24 0/0:0,0,0,22 0/0:0,0,0,18 0/0:0,0,0,21 0/0:0,0,0,19 0/0:0,0,0,18 0/0:0,0,0,21 0/0:0,0,0,21 0/0:0,0,0,21
0 5 . G . 13 PASS NS=12;DP=240 GT:CATG 0/0:0,0,0,16 0/0:0,0,0,20 0/0:0,0,0,19 0/0:0,0,0,24 0/0:0,1,0,21 0/0:0,0,0,18 0/0:0,0,0,21 0/0:0,0,0,19 0/0:0,0,0,18 0/0:0,0,0,21 0/0:0,0,0,21 0/0:0,0,0,21
0 6 . A . 13 PASS NS=12;DP=240 GT:CATG 0/0:0,16,0,0 0/0:0,20,0,0 0/0:0,19,0,0 0/0:0,24,0,0 0/0:0,22,0,0 0/0:0,18,0,0 0/0:0,21,0,0 0/0:0,19,0,0 0/0:0,18,0,0 0/0:0,21,0,0 0/0:0,21,0,0 0/0:0,21,0,0
0 7 . A . 13 PASS NS=12;DP=240 GT:CATG 0/0:0,16,0,0 0/0:0,20,0,0 0/0:0,19,0,0 0/0:0,24,0,0 0/0:0,22,0,0 0/0:0,18,0,0 0/0:0,21,0,0 0/0:0,19,0,0 0/0:0,18,0,0 0/0:0,21,0,0 0/0:0,21,0,0 0/0:0,21,0,0
0 8 . C . 13 PASS NS=12;DP=240 GT:CATG 0/0:16,0,0,0 0/0:20,0,0,0 0/0:19,0,0,0 0/0:24,0,0,0 0/0:22,0,0,0 0/0:18,0,0,0 0/0:21,0,0,0 0/0:19,0,0,0 0/0:18,0,0,0 0/0:21,0,0,0 0/0:21,0,0,0 0/0:21,0,0,0
0 9 . G . 13 PASS NS=12;DP=240 GT:CATG 0/0:0,0,0,16 0/0:0,0,0,20 0/0:0,0,0,19 0/0:0,0,0,24 0/0:0,0,0,22 0/0:0,0,0,18 0/0:0,0,0,21 0/0:0,0,0,19 0/0:0,0,0,18 0/0:0,0,0,21 0/0:0,0,0,21 0/0:0,0,0,21
0 10 . G . 13 PASS NS=12;DP=240 GT:CATG 0/0:0,0,0,16 0/0:0,0,0,20 0/0:0,0,0,19 0/0:0,0,0,24 0/0:0,0,0,22 0/0:0,0,0,18 0/0:0,0,0,21 0/0:0,0,0,19 0/0:0,0,0,18 0/0:0,0,0,21 0/0:0,0,0,21 0/0:0,0,0,21
0 11 . C . 13 PASS NS=12;DP=240 GT:CATG 0/0:16,0,0,0 0/0:20,0,0,0 0/0:19,0,0,0 0/0:24,0,0,0 0/0:22,0,0,0 0/0:18,0,0,0 0/0:21,0,0,0 0/0:19,0,0,0 0/0:18,0,0,0 0/0:21,0,0,0 0/0:21,0,0,0 0/0:20,0,1,0
0 12 . C . 13 PASS NS=12;DP=240 GT:CATG 0/0:16,0,0,0 0/0:20,0,0,0 0/0:19,0,0,0 0/0:24,0,0,0 0/0:22,0,0,0 0/0:18,0,0,0 0/0:21,0,0,0 0/0:19,0,0,0 0/0:18,0,0,0 0/0:21,0,0,0 0/0:21,0,0,0 0/0:21,0,0,0
0 13 . T . 13 PASS NS=12;DP=240 GT:CATG 0/0:0,0,16,0 0/0:0,0,20,0 0/0:0,0,19,0 0/0:0,0,24,0 0/0:0,0,22,0 0/0:0,0,18,0 0/0:0,0,21,0 0/0:0,0,19,0 0/0:0,0,18,0 0/0:0,0,21,0 0/0:0,0,21,0 0/0:0,0,21,0
0 14 . A . 13 PASS NS=12;DP=240 GT:CATG 0/0:0,16,0,0 0/0:0,20,0,0 0/0:0,19,0,0 0/0:0,24,0,0 0/0:0,22,0,0 0/0:0,18,0,0 0/0:0,21,0,0 0/0:0,19,0,0 0/0:0,18,0,0 0/0:0,21,0,0 0/0:0,21,0,0 0/0:0,21,0,0
0 15 . G . 13 PASS NS=12;DP=240 GT:CATG 0/0:0,0,0,16 0/0:0,0,0,20 0/0:0,0,0,19 0/0:0,0,0,24 0/0:0,0,0,22 0/0:0,0,0,18 0/0:0,0,0,21 0/0:0,0,0,19 0/0:0,0,0,18 0/0:0,0,0,21 0/0:0,0,0,21 0/0:0,0,0,21
0 16 . C . 13 PASS NS=12;DP=240 GT:CATG 0/0:16,0,0,0 0/0:20,0,0,0 0/0:19,0,0,0 0/0:24,0,0,0 0/0:22,0,0,0 0/0:18,0,0,0 0/0:21,0,0,0 0/0:19,0,0,0 0/0:18,0,0,0 0/0:21,0,0,0 0/0:21,0,0,0 0/0:21,0,0,0
0 17 . G . 13 PASS NS=12;DP=240 GT:CATG 0/0:0,0,0,16 0/0:0,0,0,20 0/0:0,0,0,19 0/0:0,0,0,24 0/0:0,0,0,22 0/0:0,0,0,18 0/0:0,0,0,21 0/0:0,0,0,19 0/0:0,0,0,18 0/0:0,0,0,21 0/0:0,0,0,21 0/0:0,0,0,21
0 18 . A . 13 PASS NS=12;DP=240 GT:CATG 0/0:0,16,0,0 0/0:0,20,0,0 0/0:0,19,0,0 0/0:0,24,0,0 0/0:0,22,0,0 0/0:0,18,0,0 0/0:0,21,0,0 0/0:0,19,0,0 0/0:0,18,0,0 0/0:0,21,0,0 0/0:0,21,0,0 0/0:0,21,0,0
Is ipyrad now ready to supersede pyrad? I mean - is it what you currently recommend for RAD-Seq analysis?
@StuntsPT
Yes, we've been beta testing for a few months and have decided it is now publication-ready. It is lacking a few options that are available in pyrad just because we haven't had time to incorporate them yet.
Our major development focus right now includes:
But for most general analyses ipyrad should be good to go, and much faster than pyrad. We are working on a manuscript/preprint as well that we hope to finish this summer.
-Deren
It would be useful to have information regarding genotype quality and read depth for each sample in the VCF file, which is already provided in the Stacks output but not in pyRAD's.