chasewnelson / SNPGenie

Program for estimating πN/πS, dN/dS, and other diversity measures from next-generation sequencing data
GNU General Public License v3.0
100 stars 37 forks source link

NONREF_NONPOLY and population_summary.txt empty #51

Open nicolo-tellini opened 3 years ago

nicolo-tellini commented 3 years ago

Hello

I run your tool and I find it really useful nevertheless there are some outcomes that I do not completely understand.

One is about NONREF_NONPOLY under the column class in the file site_results.txt. what does NONREF_NONPOLY means ?

The second one is the output of population_summary. Even if I have SNPs this file is always empty whatever chromosome I run :

file sites sites_coding sites_noncoding pi pi_coding pi_noncoding N_sites S_sites piN piS mean_dN_vs_ref mean_dS_vs_ref mean_gdiv_polymorphic mean_N_gdiv mean_S_gdiv mean_gdiv sites_polymorphic mean_gdiv_coding_poly sites_coding_poly mean_gdiv_noncoding_poly sites_noncoding_poly temp_vcf4_AfricanBeer.vcf 0 0 0 0 0 0 0 * 0

The command I run is :

snpgenie.pl --vcfformat=4 --snpreport=./vcf/chrIV_Scc.AfricanBeer.vcf --fastafile=./fasta/chrIV.Scc.fa --gtffile=./gft/chrIV_Scc.Scc.gft --outdir chrIV

so each file is specific for chr 4 (chrIV_Scc) and I modified the vcf as suggested in #35. here an example :

CHROM POS ID REF ALT QUAL FILTER INFO FORMAT AfricanBeer

chrIV_Scc 26094 . C A 159.76 . INFO AD:DP 4,4:39 chrIV_Scc 27939 . T C 222 . INFO AD:DP 0,4:39 chrIV_Scc 28094 . C T 222 . INFO AD:DP 0,4:39 chrIV_Scc 28654 . C T 225.01 . INFO AD:DP 0,1:39 chrIV_Scc 28968 . C T 225.01 . INFO AD:DP 0,1:39

few notes: I replaced the field INFO with INFO to make it readable. In contrast to #35 the DP field disagree with the sum of AD values because some positions were missing in the original vcf files.

I am sorry for the format of the text but I am new in git-hub.

Thank you in advance

Best,

Nicolò Tellini

nicolo-tellini commented 3 years ago

Hello,

I also add an other observation. The number of var in site_results.txt differs from the number of SNPs in the vcf although I did not specify any cutoff for the allele freq in the input command. For example, the VCF position below is missing in site_results.txt

chrI_Scc 140142 . A G 222 . AC1=2;AC=18;AF1=1;AN=18;BQB=0.830275;DP4=2,2,793,936;DP=1844;FQ=-281.989;MQ0F=0.0114504;MQ=55;MQB=0.779817;MQSB=0.705766;PV4=0.503694,1,0.269537,1; RPB=0.706422;SF=0,1,2,3,4,5,6,7,8;SGB=-0.693147;VDB=0.475097 AD:DP 0,18:18

Best,

Nicolò

singing-scientist commented 3 years ago

Hello @nicolo-tellini ! Thanks for the great questions.

NONREF_NONPOLY means the major allele at the site does not match the reference sequence, and the site is not polymorphic. Thus, the site is fixed for a non-reference nucleotide.

Regarding the puzzling results, the most likely reason is a mismatch when the SNPGenie VCF format being specified. However, it is very difficult for me to troubleshoot without exact example files. If you could prepare a set of small example files (FASTA / VCF / GTF) and email it to me, I'd be happy to give it a try and see what's happening.

Let me know! Chase