chasewnelson / SNPGenie

Program for estimating πN/πS, dN/dS, and other diversity measures from next-generation sequencing data
GNU General Public License v3.0
109 stars 37 forks source link

Empty output #58

Closed White-Shinobi closed 2 years ago

White-Shinobi commented 2 years ago

Hi Chase,

I'm trying to run SNPgenie on a reference sequence (100002) that I've mapped from 3 different samples. However, the results file are always empty and no any alarm. I have checked the input for many times but could you also check them? Thank you so much for your help.

I've called the SNPs using GT-Pro (https://www.nature.com/articles/s41587-021-01102-3), and produced the GTF file using prodigal. Here I showed all the input I used 👉 (https://github.com/White-Shinobi/SNPGenie-pipeline.git). Parameters: MINIMUM ALLELE FREQUENCY: Default used; all SNPs included SLIDING WINDOW LENGTH: 20 MULTI-FASTA MODE: Default used; No SNP REPORTS: Default auto-detected file(s): test.vcf test_revcom.vcf REFERENCE FASTA FILE: Default auto-detected file(s): 100002.fasta GTF FILE: Default auto-detected file: UHGG_909species.gtf COMPLEMENT MODE: Yes

Best, Yue

singing-scientist commented 2 years ago

Greetings, Yue! Thanks so much for using SNPGenie, and sorry for the trouble.

I have two suggestions to start:

  1. As of now, the VCF file line endings are Windows (CRLF) rather than Unix (LF). If you change them to LF as suggested in the Troubleshooting, does it work?
  2. If (1) does not work, does deleting all non-alphanumeric characters (: and .) in column 2 of the GTF file solve the issue?

If those do not work, please send the exact command line you submitted (i.e., snpgenie.pl ...) so that I may reproduce the issue with these files.

Hope this helps! Chase

White-Shinobi commented 2 years ago

Hi Chase,

Thank you so much for your quick response and suggestions. I have tried all of them but it still can not work.

Here is the code that I am using: snpgenie.pl --vcfformat=4 --slidingwindow=20

Best, Yue

Chase W. Nelson 倪誠志 @.***> 于2022年4月11日周一 09:23写道:

Greetings, Yue! Thanks so much for using SNPGenie, and sorry for the trouble.

I have two suggestions to start:

  1. As of now, the VCF file line endings are Windows (CRLF) rather than Unix (LF). If you change them to LF as suggested in the Troubleshooting https://github.com/chasewnelson/SNPGenie#troubleshooting, does it work?
  2. If (1) does not work, does deleting all non-alphanumeric characters (: and .) in column 2 of the GTF file solve the issue?

If those do not work, please send the exact command line you submitted (i.e., snpgenie.pl ...) so that I may reproduce the issue with these files.

Hope this helps! Chase

— Reply to this email directly, view it on GitHub https://github.com/chasewnelson/SNPGenie/issues/58#issuecomment-1094645039, or unsubscribe https://github.com/notifications/unsubscribe-auth/AYL642WK65YL4SXOEJJGE2DVEPHQTANCNFSM5TBJSEOA . You are receiving this because you authored the thread.Message ID: @.***>

singing-scientist commented 2 years ago

Greetings Yue!

I tried it on your input files and it died right away with a warning:

### WARNING: for ID 100002_911377, the value of AD is "3".
### WARNING: the sample values are ""0/0:3:3:0:3,0:1,0:3:0:1:0"".
### WARNING: the line is:
100002  911377  100002_911377   T   C   100 PASS    .   GT:AD2:DP:DS:AD:AF2:AD_REF:AD_ALT:AF_REF:AF_ALT "0/0:3:3:0:3,0:1,0:3:0:1:0"
### For vcfformat==4, AD must list numbers of reads for one SNP as follows: <REF>,<ALT>. SNPGenie TERMINATED.

If you do not receive this warning, it is likely that your files are formatted as Windows line endings (e.g., were saved from Excel in Windows) and will need to converted to Unix line endings as described in the Troubleshooting.

Looking at the SNPGenie VCF documentation, I think it's correct that --vcfformat=4 is the closest to yours. Unfortunately I believe the bug may be due to the use of multiple 'AD' flags, i.e., 'AD2', 'AD_REF', and 'AD_ALT'. Indeed, if I change their names, it runs. Specifically, I needed to do the following:

  1. CHANGE the 'AD2', 'AD_REF', and 'AD_ALT' to something without 'AD' in them, or simply remove them and their data;
  2. REMOVE the quotations ("") surrounding the column, which was likely introduced by Excel at some point

Once those two steps are done, it seems to work properly. I recommending doing this if you want a quick fix. Note that all polymorphic sites appear to be in non-protein-coding regions.

Finally, I advise submitting one VCF file at a time explicitly at the command line using the --snpreport argument, particularly because the GTF file has protein-coding genes on each strand.

Let me know if that helps! Chase

White-Shinobi commented 2 years ago

Hi Chase,

I corrected the vcf file in the way you suggested and it worked!

I just can't tell how much I appreciated that you responded so quickly and tested the codes yourself. You really helped us a lot. Hope we can find something interesting in our study and cite your paper in the future😄🌞

Best, Yue

Chase W. Nelson 倪誠志 @.***> 于2022年4月14日周四 19:24写道:

Greetings Yue!

I tried it on your input files and it died right away with a warning:

WARNING: for ID 100002_911377, the value of AD is "3".

WARNING: the sample values are ""0/0:3:3:0:3,0:1,0:3:0:1:0"".

WARNING: the line is:

100002 911377 100002_911377 T C 100 PASS . GT:AD2:DP:DS:AD:AF2:AD_REF:AD_ALT:AF_REF:AF_ALT "0/0:3:3:0:3,0:1,0:3:0:1:0"

For vcfformat==4, AD must list numbers of reads for one SNP as follows: ,. SNPGenie TERMINATED.

If you do not receive this warning, it is likely that your files are formatted as Windows line endings (e.g., were saved from Excel in Windows) and will need to converted to Unix line endings as described in the Troubleshooting https://github.com/chasewnelson/SNPGenie#troubleshooting .

Looking at the SNPGenie VCF documentation https://github.com/chasewnelson/SNPGenie#vcf, I think it's correct that --vcfformat=4 is the closest to yours. Unfortunately I believe the bug may be due to the use of multiple 'AD' flags, i.e., 'AD2', 'AD_REF', and 'AD_ALT'. Indeed, if I change their names, it runs. Specifically, I needed to do the following:

  1. CHANGE the 'AD2', 'AD_REF', and 'AD_ALT' to something without 'AD' in them, or simply remove them and their data;
  2. REMOVE the quotations ("") surrounding the column, which was likely introduced by Excel at some point

Once those two steps are done, it seems to work properly. I recommending doing this if you want a quick fix. Note that all polymorphic sites appear to be in non-protein-coding regions.

Finally, I advise submitting one VCF file at a time explicitly at the command line using the --snpreport argument, particularly because the GTF file has protein-coding genes on each strand.

Let me know if that helps! Chase

— Reply to this email directly, view it on GitHub https://github.com/chasewnelson/SNPGenie/issues/58#issuecomment-1099440731, or unsubscribe https://github.com/notifications/unsubscribe-auth/AYL642VGFAVNU5PF2TP6B33VFBIEVANCNFSM5TBJSEOA . You are receiving this because you authored the thread.Message ID: @.***>

White-Shinobi commented 2 years ago

Hi Chase,

Sorry to bother you again. All the pipeline worked well but we met another nasty problem: we are using SNPGenie to estimate πN/πS of a bacteria. We have called bacterial SNPs and generated the vcf file. The vcf file contained only one contig and 34710 different positions. And the gtf file contained 3534 CDS annotated for this bacteria. And total samples size is 190. However, the calculation time is too long (like >48hrs). Do you have any ideas whether SNPGenie can run in parallel? Or any other ways to make the time needed shorter?

Best, Yue

Yue Zhang @.***> 于2022年4月30日周六 10:31写道:

Hi Chase,

I corrected the vcf file in the way you suggested and it worked!

I just can't tell how much I appreciated that you responded so quickly and tested the codes yourself. You really helped us a lot. Hope we can find something interesting in our study and cite your paper in the future😄🌞

Best, Yue

Chase W. Nelson 倪誠志 @.***> 于2022年4月14日周四 19:24写道:

Greetings Yue!

I tried it on your input files and it died right away with a warning:

WARNING: for ID 100002_911377, the value of AD is "3".

WARNING: the sample values are ""0/0:3:3:0:3,0:1,0:3:0:1:0"".

WARNING: the line is:

100002 911377 100002_911377 T C 100 PASS . GT:AD2:DP:DS:AD:AF2:AD_REF:AD_ALT:AF_REF:AF_ALT "0/0:3:3:0:3,0:1,0:3:0:1:0"

For vcfformat==4, AD must list numbers of reads for one SNP as follows: ,. SNPGenie TERMINATED.

If you do not receive this warning, it is likely that your files are formatted as Windows line endings (e.g., were saved from Excel in Windows) and will need to converted to Unix line endings as described in the Troubleshooting https://github.com/chasewnelson/SNPGenie#troubleshooting.

Looking at the SNPGenie VCF documentation https://github.com/chasewnelson/SNPGenie#vcf, I think it's correct that --vcfformat=4 is the closest to yours. Unfortunately I believe the bug may be due to the use of multiple 'AD' flags, i.e., 'AD2', 'AD_REF', and 'AD_ALT'. Indeed, if I change their names, it runs. Specifically, I needed to do the following:

  1. CHANGE the 'AD2', 'AD_REF', and 'AD_ALT' to something without 'AD' in them, or simply remove them and their data;
  2. REMOVE the quotations ("") surrounding the column, which was likely introduced by Excel at some point

Once those two steps are done, it seems to work properly. I recommending doing this if you want a quick fix. Note that all polymorphic sites appear to be in non-protein-coding regions.

Finally, I advise submitting one VCF file at a time explicitly at the command line using the --snpreport argument, particularly because the GTF file has protein-coding genes on each strand.

Let me know if that helps! Chase

— Reply to this email directly, view it on GitHub https://github.com/chasewnelson/SNPGenie/issues/58#issuecomment-1099440731, or unsubscribe https://github.com/notifications/unsubscribe-auth/AYL642VGFAVNU5PF2TP6B33VFBIEVANCNFSM5TBJSEOA . You are receiving this because you authored the thread.Message ID: @.***>

singing-scientist commented 2 years ago

Greetings, Yue @White-Shinobi! I'm so glad it got working — but sorry for the runtime issue. Indeed, SNPGenie was written (without much foresight) for application to viruses. It'll work for longer stuff, but it's indeed slow. (Hopefully a future Python version will be more scalable.) I once ran it for human chromosome 1 (~249 Mbp) with 1KG SNPs and it took about 48 hours; however, I do not remember number of SNPs and it's possible you have more.

Normally the finest degree of parallelization is by sample, i.e., you could submit an array of 190 jobs (one for each sample). However, if you do not have access to such computational resources, some things that can help:

  1. Quality filter SNPs BEFORE running SNPGenie. This is necessary for quality π results anyway, but still a good thing to double and triple check. For example, you'd want to first eliminate any low-frequency SNPs that are likely to be errors given your sequencing technology's error rate (e.g., 1% or 5% are common cutoffs for Illumina). Criteria for depth, number of REF and ALT reads, strand bias, etc., should also be considered to ensure errors are being excluded.
  2. Limit one sample per VCF file and parallelize. Often, VCF files each contain ONE sample, whereas I think yours each contained multiple samples (i.e., multiple columns after INFO). This could be a source of much redundancy, because typically only one of the samples has a SNP at a given position — but SNPGenie still has to process every other sample for every variant when the input is done this way. For example, right now you might be using one VCF file with 190 columns after INFO, but you could instead use 190 VCF files, each with just 1 sample column.
  3. Parallelize by region. This would involve ~manually creating separate sets of files for each region, e.g., set1 might be for the first 1 Mbp, where you'd use the original reference FASTA but make custom GTF and VCF files containing just those genes and variants, respectively, falling within that region (1-1Mbp, or whatever).

Let me know if any of that helps, or if you have any other questions! Chase

White-Shinobi commented 2 years ago

Hi Chase,

Thank you so much for your response. As you suggested, I conducted filtering and split the 190 samples into 190 vcf files. That helped a looooot. Thank you again for your great work.

Best, Yue

Chase W. Nelson 倪誠志 @.***> 于2022年5月3日周二 19:43写道:

Greetings, Yue @White-Shinobi https://github.com/White-Shinobi! I'm so glad it got working — but sorry for the runtime issue. Indeed, SNPGenie was written (without much foresight) for application to viruses. It'll work for longer stuff, but it's indeed slow. (Hopefully a future Python version will be more scalable.) I once ran it for human chromosome 1 (~249 Mbp) with 1KG SNPs and it took about 48 hours; however, I do not remember number of SNPs and it's possible you have more.

Normally the finest degree of parallelization is by sample, i.e., you could submit an array of 190 jobs (one for each sample). However, if you do not have access to such computational resources, some things that can help:

  1. Quality filter SNPs BEFORE running SNPGenie. This is necessary for quality π results anyway, but still a good thing to double and triple check. For example, you'd want to first eliminate any low-frequency SNPs that are likely to be errors given your sequencing technology's error rate (e.g., 1% or 5% are common cutoffs for Illumina). Criteria for depth, number of REF and ALT reads, strand bias, etc., should also be considered to ensure errors are being excluded.
  2. Limit one sample per VCF file and parallelize. Often, VCF files each contain ONE sample, whereas I think yours each contained multiple samples (i.e., multiple columns after INFO). This could be a source of much redundancy, because typically only one of the samples has a SNP at a given position — but SNPGenie still has to process every other sample for every variant when the input is done this way. For example, right now you might be using one VCF file with 190 columns after INFO, but you could instead use 190 VCF files, each with just 1 sample column.
  3. Parallelize by region. This would involve ~manually creating separate sets of files for each region, e.g., set1 might be for the first 1 Mbp, where you'd use the original reference FASTA but make custom GTF and VCF files containing just those genes and variants, respectively, falling within that region (1-1Mbp, or whatever).

Let me know if any of that helps, or if you have any other questions! Chase

— Reply to this email directly, view it on GitHub https://github.com/chasewnelson/SNPGenie/issues/58#issuecomment-1116373295, or unsubscribe https://github.com/notifications/unsubscribe-auth/AYL642RJWGYE6SJMVOLWG43VIFQSRANCNFSM5TBJSEOA . You are receiving this because you were mentioned.Message ID: @.***>

singing-scientist commented 2 years ago

My pleasure — I'm so happy to hear that this worked! I'll close the issue now, but please feel free to re-open if any other questions arise.

Yours, Chase