GTF file does not contain any sense (+) strand products

chasewnelson / SNPGenie

Program for estimating πN/πS, dN/dS, and other diversity measures from next-generation sequencing data

GNU General Public License v3.0

109 stars 37 forks source link

GTF file does not contain any sense (+) strand products #65

Open dferran1 opened 2 years ago

dferran1 commented 2 years ago

Hello,

I am attempting to use SNPGenie to analyze synonymous and non-synonymous SNPs in a VCF file. I used gffread to generate a gtf file from my gff3 file and then used grep to select only the exons in the gtf file. I get a message from SNPGenie that the gtf file "does not contain any sense (+) strand products" but on visual inspection it has both + and - in the strand column (see attachment as a .txt as Github did not allow a .gtf). What is the best way to resolve this issue?

The bash command is as follows: perl $snpgenie --fastafile=$reference.chrI.linearized.fa --gtffile=$gff_file.exons.gtf --snpreport=$vcf --vcfformat=4

Thank you,

David Ferranti stickleback_v5_ensembl_genes.exons.gtf.txt

singing-scientist commented 2 years ago

Greetings! I notice that the GTF file has records for multiple chromosomes/contigs. Unfortunately, as described in the documentation, SNPGenie requires that each run must correspond to exactly one input sequence (i.e. exactly one reference contig in one FASTA file). Thus, it will need to be run separately for each chromosome — one run each, with the appropriate VCF/FASTA/GTF combination containing records for just that chrom. Let me know if that solves the issue!

dferran1 commented 2 years ago

Hello,

Thank you for the quick response--restricting the gtf file to just chromosome 1 gives the same error message.

singing-scientist commented 2 years ago

Understood. Please provide a small, fully reproducible example (i.e., one FASTA, one GTF, one VCF, and the exact command line used) and I will get to this ASAP! (Apologies for any delays as I am traveling.)

dferran1 commented 2 years ago

Sure thing--I greatly appreciate you taking the time to help me get this set up.

Here are the three files and the script I use to reformat them and run SNPGenie:

https://drive.google.com/file/d/1NQ6ofZBBGGukO2oifmCaP23IolY9HJT2/view?usp=sharing

singing-scientist commented 2 years ago

Thanks very much! Could you provide a small example, e.g., much less than 500 Mb? For example, input files for the smallest chromosome (or just a fraction of it)?

dferran1 commented 2 years ago

Here's the same set of files/script but tweaked to only extract the mitochondrial genome:

https://drive.google.com/file/d/1zXQGYdOejHdQVDALtUiJqNeD04mEIRZI/view?usp=sharing

singing-scientist commented 2 years ago

Hello David. I downloaded the files, which includes a FASTA file of 473 Mb and 25 chromosomes. However, I need a minimal example that is appropriate for SNPGenie, as described in the documentation on this GitHub: one FASTA with one sequence inside; one VCF with variants called relative to that one sequence; and a GTF for the protein-coding regions of that one sequence. Please take a close look at the documentation. If there is still an issue once input specifications are met, feel free to send me a set of example files so I may reproduce the error!

dferran1 commented 2 years ago

Hmmm the bash script should have produced the mitochondrial genome-only intermediate files. Anyways here they are: SNP_Genie_mtDNA_Files.zip

And the SNPGenie command: perl $snpgenie --fastafile=stickleback_v5_assembly.pitx1.chrM.linearized.fa --gtffile=stickleback_v5_ensembl_genes.exons.chrM.gtf --snpreport=Marine.Samples.subsampled.blackspotted.remove.ancient.filtered.bedtools.chrM.vcf --vcfformat=4

Just replace @snpgenie with the appropriate path.

singing-scientist commented 2 years ago

Greetings, David! Each coding product must be labeled as type CDS (see the GTF paragraph of the documentation. If you replace 'exon' with 'CDS' in your file, it should work. Let me know!

dferran1 commented 2 years ago

Hello,

Ah--I see. From what I can tell using a GTF with only CDS seems to fix that issue. Unfortunately I am now getting an error that some genes do not have coordinates with a complete set of codons (number of nucleotides must be a multiple of 3). I think I need to go back and troubleshoot the assembly/annotation before I can proceed.

Thanks again for all your assistance with SNPGenie.

Best,

David

Rohit-Satyam commented 1 year ago

Did this issue ever got resolved? I get a similar error. I must also admit that this game of manually altering fasta and gtf file is bizarre and confusing and should be automated. I spent the entire day altering the gtf in multiple ways and yet it didn't run.

singing-scientist commented 1 year ago

Greetings @Rohit-Satyam! Thanks very much. Indeed in an ideal world of infinite funding and time, all would be made to work for all conceivable input! Regarding the FASTA, there should be no need to alter anything — the input requirement is simply for a single reference sequence. Regarding the GTF file, the requirement is for non-redundant CDS records. If you have checked that your input meet these criteria and are still getting the error, please feel free to share a small reproducible example (FASTA, VCF, GTF) and I'd be happy to investigate what's happening.