chasewnelson / SNPGenie

Program for estimating πN/πS, dN/dS, and other diversity measures from next-generation sequencing data
GNU General Public License v3.0
102 stars 37 forks source link

Multi-Fasta Reference Issue & Output Nucleotide Diversity for all Sites (Not Only Variant) #17

Closed j3551ca closed 5 years ago

j3551ca commented 5 years ago

Hi there,

I have been experiencing issues getting nucleotide diversity (π) for every site in the whole genome for a segmented virus population. The reference fasta has 10 segments and the vcf file was generated with samtools mpileup and bcftools call -c (both v1.3.1). This returned the error: WARNING: There are multiple sequences in the FASTA file. There must be only one reference genome. SNPGenie terminated. I have worked around this by splitting the gtf, fasta, and vcf file (header needs to be present) and am writing this into a loop. Before proceeding I would like to output the "within-group" π at every site instead of only for variants (even if it is 0). Does anyone know how to do this?

Thanks!

singing-scientist commented 5 years ago

Greetings, @j3551ca ! Thanks for using SNPGenie.

Exactly right — for a segmented viral genome, each segment (and its variants and genes) must be analyzed separately (one chromosome, one analysis), with sites numbered in reference to just the appropriate chromosome/segment (1-based indexing, like VCF). So, for example, in influenza, a separate analysis would be conducted for HA (sites numbered in the VCF and GTF as 1 - length(HA)), for NA (sites numbered in the VCF and GTF as 1 - length(NA)), and so forth.

SNPGenie should output π at all variable sites. If the site has no variants, its π is by definition 0.

Let me know if this helps, or if I have missed the point.

Yours, Chase

j3551ca commented 5 years ago

Hi Chase,

Thanks for your response. I understand the π will be 0 at non-variant positions, but would like to output them anyways for specific downstream analysis.

Cheers!

singing-scientist commented 5 years ago

Now I understand. The reason these are not included is because, especially in very large genomes, this can be very space-inefficient. If you want all sites (even invariant ones with π = 0) listed in the file, I can suggest two approaches:

(1) It seems to me this is essentially a data wrangling problem and could best be solved with a simple join, unless you needed other parts of the output. For example, first create a table with a single column listing all site numbers, 1 - length(segment). Then perform a left join to this table with the SNPGenie output. For example, in R, you could create a data frame with a single column of site numbers; import the SNPGenie output as a second data frame; and then use the dplyr package left_join() function:

new_table <- dplyr::left_join(sites_only_table, SNPGenie_table)

(2) Alternatively, you may wish to modify SNPGenie itself. I can offer very little support, but I think you can achieve what you need by modifying output to the OUTFILE_GENE_DIV file handle. I think it will work if you comment out (i.e., place a '#' at the beginning of) or else remove the following six lines: 7000, 7058, 7061, 7119, 7122, and 7182, corresponding to the 'if' statements that require a site be polymorphic in order for it to be included in output. However, you might only get NA or empty values for most/all columns... Indeed, I went ahead and tried it this and it works. Of course, the coverage and nucleotide counts are not correct because invariant sites are not present in the VCF, but that might be okay (or could itself be solved by further wrangling and joining with other files you have).

Let me know if that helps.

j3551ca commented 5 years ago

Thanks, Chase! That works.