chasewnelson / SNPGenie

Program for estimating πN/πS, dN/dS, and other diversity measures from next-generation sequencing data
GNU General Public License v3.0
102 stars 37 forks source link

SNPGenie with CRISP output #24

Closed mullerbsf closed 4 years ago

mullerbsf commented 4 years ago

Dear Chase,

I am working on pooled data that and I hope to use SNPGenie to generate the population parameters. I have a question for you regarding how to enter the VCF file I have.

The variant calling was done using CRISP: https://github.com/vibansal/crisp https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2881398/pdf/btq214.pdf

After converted to VCF, the output from CRISP look like this:

CHROM POS ID REF ALT QUAL FILTER INFO FORMAT SAMPLE1 SAMPLE2 ...

2 321989 SNPID G A 9420 PASS AC=37;AF=0.389474;AN=95;CT=-31.4;DP=545,492,10;EMstats=942.07:-398.73;FLANKSEQ=cgagggagga:G:caaggcaagg;HWEstats=12.4;MQS=31,18,119,1005;NP=38;NS=19;VF=EMpass;VP=19;VT=SNV GT:GQ:DP:ADf:ADr:ADb 0/0/0/1/1:7:53:15,8:7,14:1,0 0/0/0/0/1:4:57:20,3:15,11:0,0

Can SNPGenie use this output? It is not clear to me what VCF input format option to use (--vcfformat=2 or 4), since CRISP output (ADf:ADr:ADb) does not have a regular AD format.

My goal is to generate diversity statistics by sample. When I tried to use --vcfformat=2 option, the SNPGenie considered the whole population. However, with --vcfformat=4 option, I am not sure if the AD format is appropriately recognized.

Your help is much appreciated.

Best regards, Barbara.

singing-scientist commented 4 years ago

Thanks very much for using SNPGenie! Happy to help.

Indeed, I have never encountered CRISP or its VCF format before. Unfortunately, if you use SNPGenie format #1, you get the whole (meta-)population, and not each sample individually. Is it possible for CRISP to output a distinct file for each sample?

If I am correct, in its current form the closest format is SNPGenie's #4. (But I can find no documentation on what the ADb column means in CRISP.) A quick solution for you would be to manually (or write a script to) FIND/REPLACE the ADf:ADr tag with just AD, and replace the corresponding data (20,3:15,11) data with a single string (35,14 -- because 20+15=35 and 3+11=14). Let me know if I have understood the data format correctly.

I am happy to add the new format, but it will be at least a week before I can do so.

Let me know!

mullerbsf commented 4 years ago

Dear Chase,

Thank you for your answer.

If you can add this new CRISP format on SNPGenie will be much appreciated.

However, I contacted the CRISP developer to clarify the output, because the ADf and ADr never match with the total depth (DP field).

I am still confused about the ADf, ADr and ADb meaning.

In the VCF definitions: DP: "Total number of reads (+strand,-strand,bidirectional) across all pools (filtered reads only)". ADf: "Number of reads aligned to the forward strand of the genome supporting reference allele and the alternate alleles in the order listed". ADr: "Number of reads aligned to the reverse strand of the genome supporting reference allele and the alternate alleles in the order listed". ADb: "Number of overlapping paired-end reads (read from both strands) supporting reference allele and the alternate alleles in the order listed”.

ADf is a comma separated list of the read counts for each allele on the forward strand. You can sum up the ADf, ADr and ADb values for each allele to get the allele balance.

FORMAT: GT:GQ:DP:ADf:ADr:ADb

SAMPLE 1: 0/0/0/1/1:7:53:15,8:7,14:1,0 SAMPLE 2: 0/0/0/0/1:4:57:20,3:15,11:0,0

It seems like the total depth is 45 (SAMPLE 1) and 49 (SAMPLE 2), respectively. The sums of the ADf, ADr and ADb fields, however, are not equal to the total depth (53 and 57, as in DP field).

Then, I do not know if I can replace it, as you have suggested.

I appreciate your attention. Best regards,

Barbara.

singing-scientist commented 4 years ago

Thanks, Barbara!

Did you hear back from the CRISP developer(s) about this issue? I am very hesitant to spend any time on this problem unless they can clarify what their output means. To me, it is strange that DP would be greater than the read counts, because they define DP as "filtered reads only" — so if anything I'd expect its value to be less than or equal to the sums obtained from adding the AD columns. As an alternative, you could also consider converting to SNPGenie's VCF format #3 — just know that in this case, each sample would need its own file. Each "sample" file would be identical except for the final column, which would be unique for each sample.

Let me know -- Chase

mullerbsf commented 4 years ago

Dear Chase,

The CRISP developer answered (Vikas Bansal): "The total read depth (DP) includes all reads. Some of these may be filtered out for the calculation of ADf, ADr and ADb which are used for variant calling."

He will update the code to output the correct DP value with filtered reads only (See: https://github.com/vibansal/crisp/issues/14).

Therefore, the ADf, ADr and ADb fields are filtered reads. Are you able to add this format to your SNPGenie? It will be necessary for other future users that want to use SNPGenie with CRISP output.

For now, I created a code to sum all these fields to replace the DP and AD to run SNPGenie in the VCF format #4.

Thank you very much for this help. Best regards,

Barbara.

singing-scientist commented 4 years ago

Fantastic! This makes more sense.

If you could be so kind as to summarize the new format, especially the fields relevant to allele frequencies, and send at least one example (FASTA, GTF, and VCF), I would be happy to incorporate this format. Given my current queue, I would expect to finish this in approximately 1 week, if it would still be useful to you. Please let me know.

Yours, Chase

singing-scientist commented 4 years ago

Dear @mullerbsf: has this issue been resolved?

mullerbsf commented 4 years ago

Yes. Thank you for all your support.