chasewnelson / SNPGenie

Program for estimating πN/πS, dN/dS, and other diversity measures from next-generation sequencing data
GNU General Public License v3.0
106 stars 37 forks source link

within-host diversity influenza whole genome #61

Closed Su06690 closed 2 years ago

Su06690 commented 2 years ago

Hi,

I want to calculate nucleotide diversity of WGS of influenza (within each samples, within group as well as between group) I am new to the field and I followed the instructions , first with the snpgenie.pl for one vcf file(virus sample from one patient).

this is my command (base) sumyat@Sus-Mac-mini SU_file % snpgenie.pl --vcfformat=4 --snpreport=2339.vcf --fastafile=ref.fasta --gtffile=ref.gtf

I have got this below message. I have attached the files for the review.

I am very new to the field, and apologies if the question is not relevant

Calculating and storing protein-coding genome and variant data (that dogma stuff)...

WARNING: SNPGenie only considers MNV's up to 5nt in length.

WARNING: Conflicting coverages reported at temp_vcf4_2339.vcf|PB2|1332

An averaging has taken place.

WARNING: In temp_vcf4_2339.vcf, the variant at site 715,

the variant data imply a negative proportion of G nucleotides: -6.99671786724416.

This may result from rounding error, in which case the number will be very small in magnitude,

variants which are assigned to the wrong site in the SNP Report, or multiple copies of the same genes/exons in the GTF file. Results at this site

may be unreliable; G prop set to 0; proceed with caution.

WARNING: In temp_vcf4_2339.vcf, at site 636,

the coverage (8445.667) does not equal the nucleotide sum (169761.000).

WARNING: In temp_vcf4_2339.vcf, M, site 306,

the reference nucleotide in the SNP Report does not match the FASTA file.

Results at this site are unreliable: proceed with caution, and consider

re-calling SNPs (or fixing the SNP Reports) to address the issue.

WARNING: In temp_vcf4_2339.vcf|PB2|1834,

the nucleotide total (which should be 100.00%) is instead: 200.00%.

This should occur only when conflicting coverages have been reported.

COMPLETED.

singing-scientist commented 2 years ago

Greetings, @Su06690! Sorry for the delay. Please feel free to reopen is there is still an issue; also, if so, please attach a set of input files so I can test them myself.

Best, Chase

Su06690 commented 2 years ago

Dear Chase,

Thank you so much for the reply. I am still facing some error messages, and the output files results. I am attaching my merged vcf files , ref and gtf files as well as output files. I have 90 samples ( 1 sample for each individual) and i would like to see the per sample nucleotide diversity. I would like to do that for all 90 samples so I merged their all vcf files and run with % perl snpgenie.pl --minfreq=0.03 --vcfformat=4 --snpreport=merged.vcf --fastafile=A_H3N2_A_Perth_16_2009.fasta the vcf are called using freebayes and annotated with snpeff GTF is converted from GFF3 which is available for download at INSAFLU website using AGAT.

NOTE; I tried with % perl snpgenie.pl --minfreq=0.03 --vcfformat=4 --fastafile=A_H3N2_A_Perth_16_2009.fasta

I took the only one sample but not the other but when i tried with --vcfformat=3 --vcfformat=2

they do for all the samples but show me conflicting coverage results Thank you so much rechasewnelsonsnpgeniewithinhostdiversityinfluenza.zip

On Thu, Aug 25, 2022 at 10:11 AM Chase W. Nelson 倪誠志 < @.***> wrote:

Greetings, @Su06690 https://github.com/Su06690! Sorry for the delay. Please feel free to reopen is there is still an issue; also, if so, please attach a set of input files so I can test them myself.

Best, Chase

— Reply to this email directly, view it on GitHub https://github.com/chasewnelson/SNPGenie/issues/61#issuecomment-1226658908, or unsubscribe https://github.com/notifications/unsubscribe-auth/APFGNDOORBRHT3UBM5YQ23TV23B57ANCNFSM57IBVFPQ . You are receiving this because you were mentioned.Message ID: @.***>

-- Su Myat Han M.B.,B.S, MHSc, PhD (candidate) London School of Hygiene and Tropical Medicine- Nagasaki University joint PhD program Email: @., @. Ph: +81-80-78443937

singing-scientist commented 2 years ago

I'm sorry, your message appears incomplete and I see no attached files.

Su06690 commented 2 years ago

Hi i updated and resent.

Can you let me know whether you can access?

Thank you Su On Fri, Aug 26, 2022 at 0:48 Chase W. Nelson 倪誠志 @.***> wrote:

I'm sorry, your message appears incomplete and I see no attached files.

— Reply to this email directly, view it on GitHub https://github.com/chasewnelson/SNPGenie/issues/61#issuecomment-1227454603, or unsubscribe https://github.com/notifications/unsubscribe-auth/APFGNDPWR6ZJ4PEWH33UF73V26IT3ANCNFSM57IBVFPQ . You are receiving this because you were mentioned.Message ID: @.***>

-- Su Myat Han M.B.,B.S, MHSc, PhD (candidate) London School of Hygiene and Tropical Medicine- Nagasaki University joint PhD program Email: @., @. Ph: +81-80-78443937

singing-scientist commented 2 years ago

Greetings, Su! Thanks very much, I received your files via email.

In viewing your VCF file, it seems that these variants are called against eight contigs (which are numbered 1-8), probably influenza genome segments. Unfortunately, as described in the documentation, VCF files must each correspond to a single sample, with variants called relative to exactly one reference sequence (e.g., segment; one sequence in one FASTA file).

Here, it seems the variants in the VCF file are called with respect to sequence(s) other than what is present in the FASTA file. What is needed is to run the program separately for each VCF/FASTA combination. For example, if you called the variants for a sample for segment 1, you need to run that individually — and then again for each segment. Please let me know if this makes sense!

Chase

Su06690 commented 2 years ago

thank you very much for the quick reply. I will run with each segment as your suggestion.

One question, what about if I want to know the pi for entire genome? I just calculate for each segment and then take the mean across manually?

best wishes Su

On Wed, Aug 31, 2022 at 1:49 AM Chase W. Nelson 倪誠志 < @.***> wrote:

Greetings, Su! Thanks very much, I received your files via email.

In viewing your VCF file, it seems that these variants are called against eight contigs (which are numbered 1-8), probably influenza genome segments. Unfortunately, as described in the documentation, VCF files must each correspond to a single sample, with variants called relative to exactly one reference sequence (e.g., segment; one sequence in one FASTA file).

Here, it seems the variants in the VCF file are called with respect to sequence(s) other than what is present in the FASTA file. What is needed is to run the program separately for each VCF/FASTA combination. For example, if you called the variants for a sample for segment 1, you need to run that individually — and then again for each segment. Please let me know if this makes sense!

Chase

— Reply to this email directly, view it on GitHub https://github.com/chasewnelson/SNPGenie/issues/61#issuecomment-1231918695, or unsubscribe https://github.com/notifications/unsubscribe-auth/APFGNDP56FHDIJWGCAMQDMTV3Y3TFANCNFSM57IBVFPQ . You are receiving this because you modified the open/close state.Message ID: @.***>

-- Su Myat Han M.B.,B.S, MHSc, PhD (candidate) London School of Hygiene and Tropical Medicine- Nagasaki University joint PhD program Email: @., @. Ph: +81-80-78443937

singing-scientist commented 2 years ago

Greetings, @Su06690 !

There are actually two approaches, which (perhaps) answer slightly different questions:

  1. If there are eight segments, you could calculate π (or πN, or πS) separately for each segment and then take their average. Statistically, this has the effect of "weighting" each segment equally
  2. Alternatively, you could sum all the diffs and sites for the whole genome to obtain diffs_SUM and sites_SUM. The overall π would then be π = diffs_SUM / sites_SUM. This could also be done separately for N and S. This gives the overall π value, where sites all contribute equally (i.e., not averaging segments first)

Let me know if that makes sense! Chase

Su06690 commented 2 years ago

Hi Chase,

Thank you so much. Finally, it ran smoothly.

What you suggest makes sense. I will try both ways

Thank you again for the suggestion

Best Su

On Fri, Sep 2, 2022 at 1:16 AM Chase W. Nelson 倪誠志 @.***> wrote:

Greetings, @Su06690 https://github.com/Su06690 !

There are actually two approaches, which (perhaps) answer slightly different questions:

  1. If there are eight segments, you could calculate π (or πN, or πS) separately for each segment and then take their average. Statistically, this has the effect of "weighting" each segment equally
  2. Alternatively, you could sum all the diffs and sites for the whole genome to obtain diffs_SUM and sites_SUM. The overall π would then be π = diffs_SUM / sites_SUM. This could also be done separately for N and S. This gives the overall π value, where sites all contribute equally (i.e., not averaging segments first)

Let me know if that makes sense! Chase

— Reply to this email directly, view it on GitHub https://github.com/chasewnelson/SNPGenie/issues/61#issuecomment-1234499212, or unsubscribe https://github.com/notifications/unsubscribe-auth/APFGNDIHMR77LQV4VAIR2NLV4DJFRANCNFSM57IBVFPQ . You are receiving this because you were mentioned.Message ID: @.***>

-- Su Myat Han M.B.,B.S, MHSc, PhD (candidate) London School of Hygiene and Tropical Medicine- Nagasaki University joint PhD program Email: @., @. Ph: +81-80-78443937

singing-scientist commented 2 years ago

I'm so glad to hear that! Sounds great, please let me know — closing this issue for now.

Yours, Chase

Su06690 commented 1 year ago

Hi Chase,

I have one more question

I run for each segment for each sample, which works great

And I also run with the command snpgenie.pl --minfreq=0.03 --vcfformat=4 --fastafile=seg1.fasta --gtffile=seg1.gtf

SNPgenie can identify all the vcf files in the samae directory(vcf format 4), but it gives the results as only one file (I had 90 samples, but the results didnt show by each sample, instead it showed as one sample). How is the results calculated for this?

Thanks so much for helping me with questions. apologies if my question is stupid

Best Su

On Fri, Sep 2, 2022 at 10:59 PM Chase W. Nelson 倪誠志 < @.***> wrote:

Closed #61 https://github.com/chasewnelson/SNPGenie/issues/61 as completed.

— Reply to this email directly, view it on GitHub https://github.com/chasewnelson/SNPGenie/issues/61#event-7313516008, or unsubscribe https://github.com/notifications/unsubscribe-auth/APFGNDKAAAS4HOG2DJHAO7DV4IB55ANCNFSM57IBVFPQ . You are receiving this because you were mentioned.Message ID: @.***>

-- Su Myat Han M.B.,B.S, MHSc, PhD (candidate) London School of Hygiene and Tropical Medicine- Nagasaki University joint PhD program Email: @., @. Ph: +81-80-78443937

singing-scientist commented 1 year ago

Greetings, Su! Within each results file, the first column should tell you the name of the input VCF file to which the results refer. But perhaps I'm missing the point! Let me know...