chasewnelson / SNPGenie

Program for estimating πN/πS, dN/dS, and other diversity measures from next-generation sequencing data
GNU General Public License v3.0
102 stars 37 forks source link

GTF problems and empty outputs #15

Closed StephanieRodrigues closed 5 years ago

StephanieRodrigues commented 5 years ago

Hello, Im trying to use SNPGenie to calculate my dN/dS in a bacteria genome. But Im in trouble and probably is because my gtf file. I started to manipulate this type of files recently and I`m in the beginner level on bioinformatic world. So I will describe what I exactly did.

First I used gffread to convert the gff file in the gtf file. Here is the command that I used: ./gffcompare -E Aus0004.gff -T -o- | more And the program created the file gffcmp.combined.gtf, I just rename to Aus0004.gtf (Just to know, because this is the name that I will start to use in the following commands)

Than I went to SNPGenie, I did the download, used the command chmod +x to make de script executable. Then I used this command to run: ./snpgenie.pl --vcfformat=3 --snpreport=CL9800.vcf —fastafile=Ef_Aus0004.fasta --gtffile=Aus0004.gtf

And this message appeared: _you have not selected a MIN. MINOR ALLELE FREQ. All variants in the SNP report(s) will be included.

WARNING: Aus0004.gtf does not contain any sense (+) strand products. SNPGenie terminated._

I dont know what is wrong and what I need manipulate in the gtf file! Im attaching the gtf file if you need to look. Aus004gtf.zip

Thank you!

singing-scientist commented 5 years ago

Hello @StephanieRodrigues — thanks very much for using SNPGenie! The GTF file needs to contain 'CDS' records for the protein-coding genes you wish to analyze. Right now, I can only see 'exon' and 'transcript' records. If the exons correspond to exactly the protein-coding genes you want to analyze, you could replace 'exon' with 'CDS' throughout and see if that works. The GitHub repository contains a gtf example, and exact specifications are also described here: https://github.com/chasewnelson/SNPGenie#gtf

Let me know... C

StephanieRodrigues commented 5 years ago

Hi Chase. Thank you so much. Yeah. I'm looking to my gff files and the RefSeq gene was transformed in RefSeq transcript. And the Protein homology line in the gff file was transformed in exon. So, I need to replace the transcript_id to CDS instead of exon. Because in this line is my gene_id. Right? Also, should I delete the exon line?

Regards.

singing-scientist commented 5 years ago

Right — if the format matches what's described in the documentation, there should be not problem. It should not be necessary to delete the 'transcript' or 'exon' lines, as long as the 'CDS' lines are present. Let me know...

StephanieRodrigues commented 5 years ago

Worked in parts, I replace it with CDS and he recognizes now, but the problem now is the following:

WARNING: CDS annotation(s) in Aus0004_new_mod2.gtf does not have a gene_id. SNPGenie terminated.

I'm sending the new gtf, gene_id is there. I tried to replace every *_id that are there in the last column by gene_id and it still didn`t work. A thousand apologies for the inconvenience Aus0004_new_mod2.gtf.zip

singing-scientist commented 5 years ago

The gtf file must be formatted as described in the documentation and example: https://github.com/chasewnelson/SNPGenie#gtf

It looks like all your gene_id's have double quotes, e.g., gene_id ""XLOC_000001"" when it should be gene_id "XLOC_000001", which is likely creating the problem.

StephanieRodrigues commented 5 years ago

OMG, finally it`s working! Chase thank you so much for your patience in my really little mistakes!

singing-scientist commented 5 years ago

No problem! I'm glad it is working at last. Don't hesitate to let me know if you have any further questions.