dereneaton / pyrad

Assembly and analysis of RADseq data sets
GNU General Public License v3.0
27 stars 16 forks source link

More info in VCF output #14

Closed ODiogoSilva closed 8 years ago

ODiogoSilva commented 9 years ago

It would be useful to have information regarding genotype quality and read depth for each sample in the VCF file, which is already provided in the Stacks output but not in pyRAD's.

StuntsPT commented 9 years ago

I second this request!

dereneaton commented 9 years ago

This is definitely on the TODO list. The necessary data is kind of scattered throughout the code, so it's a little complicated. And it might be a few weeks before I get around to it. My thought is to use a large dictionary to store this information that could be pickled between steps.

This would need to match the following:

For each locus: (indexed in s7) For each consens seq in that locus (matched to non-indexed loci in cat.consens_.gz): For each read that was clustered into this consens seq (matched in .u files): The genotype score: (1) Read depth (2) Probability of genotype (3) If the majority rule cons base calling is turned on for reads below a certain depth then this should be indicated also.

Do you think users would want a FULL VCF file? or only for variable sites? Currently pyRAD only outputs the latter, but it could do both, thought that would require saving a lot more data.

ODiogoSilva commented 9 years ago

What do you mean by full VCF file? Including the invariable sequence data around each variable site?

dereneaton commented 9 years ago

Exactly. Some of the invariable sites could be incorrectly called, and you may want to know if the base call was made when only one allele was present at that site (10/0, P~100%) or when a minor allele came up once or twice (10/1, P~95%), and what the probability (P) was given your estimated error rate and heterozygosity.

I don't have any analyses planned personally where I would envision using quality score data for invariable sites, so my first inclination would be to exclude these, but if this is something that people want to use, or that some software I'm not familiar with makes use of, then I suppose it should be included.

ODiogoSilva commented 9 years ago

Personally I have no need for that extra information and I'm also not familiar with programs that use that kind of information. I would greatly appreciate the depth and genotype quality or any other relevant piece of information on variable sites though!

ODiogoSilva commented 8 years ago

ping

dereneaton commented 8 years ago

Read depth information is now available in the VCF output in ipyrad (http://ipyrad.readthedocs.io). I encourage users to switch over to the new software.

Example:

##fileformat=VCFv4.0
##fileDate=2016/08/11
##source=ipyrad_v.0.3.25
##reference=pseudo-reference (most common base at site)
##phasing=unphased
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=CATG,Number=1,Type=String,Description="Base Counts (CATG)">
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  1A_0    1B_0    1C_0    1D_0    2E_0    2F_0    2G_0    2H_0    3I_0    3J_0    3K_0    3L_0
0       0       .       G       .       13      PASS    NS=12;DP=240    GT:CATG 0/0:0,0,0,16    0/0:0,0,0,20    0/0:0,0,0,19    0/0:0,0,0,24    0/0:0,0,0,22    0/0:0,0,0,18    0/0:0,0,0,21    0/0:0,0,0,19    0/0:0,0,0,18    0/0:0,0,0,21    0/0:0,0,0,21    0/0:0,0,0,21
0       1       .       A       .       13      PASS    NS=12;DP=240    GT:CATG 0/0:0,16,0,0    0/0:0,20,0,0    0/0:0,19,0,0    0/0:0,24,0,0    0/0:0,22,0,0    0/0:0,18,0,0    0/0:0,21,0,0    0/0:0,19,0,0    0/0:0,18,0,0    0/0:0,21,0,0    0/0:0,21,0,0    0/0:0,21,0,0
0       2       .       T       .       13      PASS    NS=12;DP=240    GT:CATG 0/0:0,0,16,0    0/0:0,0,20,0    0/0:0,0,19,0    0/0:0,0,24,0    0/0:0,0,22,0    0/0:0,0,18,0    0/0:0,0,21,0    0/0:0,0,19,0    0/0:0,0,18,0    0/0:0,0,21,0    0/0:0,0,21,0    0/0:0,0,21,0
0       3       .       G       .       13      PASS    NS=12;DP=240    GT:CATG 0/0:0,0,0,16    0/0:0,0,0,20    0/0:0,0,0,19    0/0:0,0,0,24    0/0:0,0,0,22    0/0:0,0,0,18    0/0:0,0,0,21    0/0:0,0,0,19    0/0:0,0,0,18    0/0:0,0,0,21    0/0:0,0,0,21    0/0:0,0,0,21
0       4       .       G       .       13      PASS    NS=12;DP=240    GT:CATG 0/0:0,0,0,16    0/0:0,0,0,20    0/0:0,0,0,19    0/0:0,0,0,24    0/0:0,0,0,22    0/0:0,0,0,18    0/0:0,0,0,21    0/0:0,0,0,19    0/0:0,0,0,18    0/0:0,0,0,21    0/0:0,0,0,21    0/0:0,0,0,21
0       5       .       G       .       13      PASS    NS=12;DP=240    GT:CATG 0/0:0,0,0,16    0/0:0,0,0,20    0/0:0,0,0,19    0/0:0,0,0,24    0/0:0,1,0,21    0/0:0,0,0,18    0/0:0,0,0,21    0/0:0,0,0,19    0/0:0,0,0,18    0/0:0,0,0,21    0/0:0,0,0,21    0/0:0,0,0,21
0       6       .       A       .       13      PASS    NS=12;DP=240    GT:CATG 0/0:0,16,0,0    0/0:0,20,0,0    0/0:0,19,0,0    0/0:0,24,0,0    0/0:0,22,0,0    0/0:0,18,0,0    0/0:0,21,0,0    0/0:0,19,0,0    0/0:0,18,0,0    0/0:0,21,0,0    0/0:0,21,0,0    0/0:0,21,0,0
0       7       .       A       .       13      PASS    NS=12;DP=240    GT:CATG 0/0:0,16,0,0    0/0:0,20,0,0    0/0:0,19,0,0    0/0:0,24,0,0    0/0:0,22,0,0    0/0:0,18,0,0    0/0:0,21,0,0    0/0:0,19,0,0    0/0:0,18,0,0    0/0:0,21,0,0    0/0:0,21,0,0    0/0:0,21,0,0
0       8       .       C       .       13      PASS    NS=12;DP=240    GT:CATG 0/0:16,0,0,0    0/0:20,0,0,0    0/0:19,0,0,0    0/0:24,0,0,0    0/0:22,0,0,0    0/0:18,0,0,0    0/0:21,0,0,0    0/0:19,0,0,0    0/0:18,0,0,0    0/0:21,0,0,0    0/0:21,0,0,0    0/0:21,0,0,0
0       9       .       G       .       13      PASS    NS=12;DP=240    GT:CATG 0/0:0,0,0,16    0/0:0,0,0,20    0/0:0,0,0,19    0/0:0,0,0,24    0/0:0,0,0,22    0/0:0,0,0,18    0/0:0,0,0,21    0/0:0,0,0,19    0/0:0,0,0,18    0/0:0,0,0,21    0/0:0,0,0,21    0/0:0,0,0,21
0       10      .       G       .       13      PASS    NS=12;DP=240    GT:CATG 0/0:0,0,0,16    0/0:0,0,0,20    0/0:0,0,0,19    0/0:0,0,0,24    0/0:0,0,0,22    0/0:0,0,0,18    0/0:0,0,0,21    0/0:0,0,0,19    0/0:0,0,0,18    0/0:0,0,0,21    0/0:0,0,0,21    0/0:0,0,0,21
0       11      .       C       .       13      PASS    NS=12;DP=240    GT:CATG 0/0:16,0,0,0    0/0:20,0,0,0    0/0:19,0,0,0    0/0:24,0,0,0    0/0:22,0,0,0    0/0:18,0,0,0    0/0:21,0,0,0    0/0:19,0,0,0    0/0:18,0,0,0    0/0:21,0,0,0    0/0:21,0,0,0    0/0:20,0,1,0
0       12      .       C       .       13      PASS    NS=12;DP=240    GT:CATG 0/0:16,0,0,0    0/0:20,0,0,0    0/0:19,0,0,0    0/0:24,0,0,0    0/0:22,0,0,0    0/0:18,0,0,0    0/0:21,0,0,0    0/0:19,0,0,0    0/0:18,0,0,0    0/0:21,0,0,0    0/0:21,0,0,0    0/0:21,0,0,0
0       13      .       T       .       13      PASS    NS=12;DP=240    GT:CATG 0/0:0,0,16,0    0/0:0,0,20,0    0/0:0,0,19,0    0/0:0,0,24,0    0/0:0,0,22,0    0/0:0,0,18,0    0/0:0,0,21,0    0/0:0,0,19,0    0/0:0,0,18,0    0/0:0,0,21,0    0/0:0,0,21,0    0/0:0,0,21,0
0       14      .       A       .       13      PASS    NS=12;DP=240    GT:CATG 0/0:0,16,0,0    0/0:0,20,0,0    0/0:0,19,0,0    0/0:0,24,0,0    0/0:0,22,0,0    0/0:0,18,0,0    0/0:0,21,0,0    0/0:0,19,0,0    0/0:0,18,0,0    0/0:0,21,0,0    0/0:0,21,0,0    0/0:0,21,0,0
0       15      .       G       .       13      PASS    NS=12;DP=240    GT:CATG 0/0:0,0,0,16    0/0:0,0,0,20    0/0:0,0,0,19    0/0:0,0,0,24    0/0:0,0,0,22    0/0:0,0,0,18    0/0:0,0,0,21    0/0:0,0,0,19    0/0:0,0,0,18    0/0:0,0,0,21    0/0:0,0,0,21    0/0:0,0,0,21
0       16      .       C       .       13      PASS    NS=12;DP=240    GT:CATG 0/0:16,0,0,0    0/0:20,0,0,0    0/0:19,0,0,0    0/0:24,0,0,0    0/0:22,0,0,0    0/0:18,0,0,0    0/0:21,0,0,0    0/0:19,0,0,0    0/0:18,0,0,0    0/0:21,0,0,0    0/0:21,0,0,0    0/0:21,0,0,0
0       17      .       G       .       13      PASS    NS=12;DP=240    GT:CATG 0/0:0,0,0,16    0/0:0,0,0,20    0/0:0,0,0,19    0/0:0,0,0,24    0/0:0,0,0,22    0/0:0,0,0,18    0/0:0,0,0,21    0/0:0,0,0,19    0/0:0,0,0,18    0/0:0,0,0,21    0/0:0,0,0,21    0/0:0,0,0,21
0       18      .       A       .       13      PASS    NS=12;DP=240    GT:CATG 0/0:0,16,0,0    0/0:0,20,0,0    0/0:0,19,0,0    0/0:0,24,0,0    0/0:0,22,0,0    0/0:0,18,0,0    0/0:0,21,0,0    0/0:0,19,0,0    0/0:0,18,0,0    0/0:0,21,0,0    0/0:0,21,0,0    0/0:0,21,0,0
StuntsPT commented 8 years ago

Is ipyrad now ready to supersede pyrad? I mean - is it what you currently recommend for RAD-Seq analysis?

dereneaton commented 8 years ago

@StuntsPT

Yes, we've been beta testing for a few months and have decided it is now publication-ready. It is lacking a few options that are available in pyrad just because we haven't had time to incorporate them yet.

Our major development focus right now includes:

But for most general analyses ipyrad should be good to go, and much faster than pyrad. We are working on a manuscript/preprint as well that we hope to finish this summer.

-Deren