chasewnelson / SNPGenie

Program for estimating πN/πS, dN/dS, and other diversity measures from next-generation sequencing data
GNU General Public License v3.0
102 stars 37 forks source link

mulifasta genome questions #11

Closed binshuangli closed 5 years ago

binshuangli commented 5 years ago

Hi Chase,

I am working with a genome with multiple scaffolds. I followed the instruction to split the fasta file into individual single sequence fasta file now. I am curious what the next step is. Should I also split the gtf file and vcf file into separate files or they can be left unchanged?

I tried to grep the information from the gtf and vcf file for a specific scaffold. SNPGenie seems to be running OK and then killed displaying "Performing final calculations, noncoding overlap analysis, and output... Killed" for unknown reason. I have the test files attached here: https://drive.google.com/open?id=1HphfXSw7QPJRuXTGf1nO98Cjd_agP6ok

singing-scientist commented 5 years ago

Hello, @binshuangli !

That is exactly right. What is important is that the SITE COORDINATES (1-based) in the GTF and VCF files match the (1-based) positions in the FASTA file. You can think of SNPGenie input as a single unit corresponding to a single DNA sequence. You will need to run SNPGenie once for every sequence/VCF/GTF combination — be it a genome, chromosome, or scaffold. Thus, in your case, you could place the three input files for each scaffold in its own directory, running SNPGenie once in each directory. Sites and differences can then be summed for genome-wide values, if you wish.

SNPGenie worked for me using your example, using the following command:

snpgenie.pl --vcfformat=3 --snpreport=pool_FR_alfalfa.sorted.utg000001l_pilon10.vcf --fastafile=racon4_pilon10_utg000001l_pilon10.fasta --gtffile=utg000001l_pilon10_corrected_0start.gtf

You should obtain πN = 0.011 and πS = 0.

If that does not work, please provide the exact command-line input you used. Let me know...

Yours, Chase

binshuangli commented 5 years ago

Hi Chase,

Thanks for your reply and confirmation! It turned out the process is killed because of the memory restriction of the login node on our linux server. Now I could complete the analysis by splitting all files into individual scaffolds. I also modified your code a bit to store the output into directories with different names based on the input.

Best, Binshuang