deweylab / RSEM

RSEM: accurate quantification of gene and isoform expression from RNA-Seq data
http://deweylab.biostat.wisc.edu/rsem/
GNU General Public License v3.0
404 stars 117 forks source link

rsem-gff3-to-gtf Fails on GFF3 file from NCBI #92

Open winni2k opened 6 years ago

winni2k commented 6 years ago

I would like some help understanding why rsem-gff3-to-gtff is failing.

When I run

rsem-gff3-to-gtf data/yeast/GCF_000146045.2_R64_genomic.gff output

then I get the error message

Traceback (most recent call last):
  File "rsem-gff3-to-gtf", line 254, in <module>
    flush_out(fout)
  File "rsem-gff3-to-gtf", line 208, in flush_out
    for start, end in transcript:
TypeError: iter() returned non-iterator of type 'Transcript'

I downloaded the GFF3 file from NCBI at ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/fungi/Saccharomyces_cerevisiae/latest_assembly_versions/GCF_000146045.2_R64/GCF_000146045.2_R64_genomic.gff.gz

Here is the header of that file:

##gff-version 3
#!gff-spec-version 1.21
#!processor NCBI annotwriter
#!genome-build R64
#!genome-build-accession NCBI_Assembly:GCF_000146045.2
#!annotation-source SGD R64-2-1
##sequence-region NC_001133.9 1 230218
##species https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=559292
NC_001133.9     RefSeq  region  1       230218  .       +       .       ID=id0;Dbxref=taxon:559292;Name=I;chromosome=I;gbkey=Src;genome=chromosome;mol_type=genomic DNA;strain=S288c
NC_001133.9     RefSeq  telomere        1       801     .       -       .       ID=id1;Dbxref=SGD:S000028862;Note=TEL01L%3B Telomeric region on the left arm of Chromosome I%3B composed of an X element core sequence%2C X element combinatorial repeats%2C and a short terminal stretch of telomeric repeats;gbkey=telomere

When I run

rsem-gff3-to-gtf --version

Then I get this output:

rsem-gff3-to-gtf: error: the following arguments are required: input_GFF3_file, output_GTF_file

usage: rsem-gff3-to-gtf [-h] [--RNA-patterns <patterns>]
                        [--extract-sequences <output.fa>]
                        input_GFF3_file output_GTF_file

Convert GFF3 files to GTF files.

positional arguments:
  input_GFF3_file       Input GFF3 file.
  output_GTF_file       Output GTF file.

optional arguments:
  -h, --help            show this help message and exit
  --RNA-patterns <patterns>
                        Types of RNAs to be extracted, e.g. mRNA,rRNA
                        (default: None)
  --extract-sequences <output.fa>
                        If GFF3 file contains reference sequences, extract
                        them to the specified file (default: None)
winni2k commented 6 years ago

According to conda I have version 1.3.0 installed:

$ conda list | grep rsem
rsem                      1.3.0               boost1.64_3    bioconda
winni2k commented 6 years ago

Version 1.2.28 runs without throwing an exception, and the output file is not empty.

EidrianGM commented 5 years ago

The same happens to me when running:

rsem-gff3-to-gtf GCF_000001405.38_GRCh38.p12_genomic.gff human_refseq.gtf [...] Loaded 3500000 lines Loaded 3600000 lines Cannot recognize transcript rna3's parent rna2, a gene feature might be missing. [...] failed! Plase check if you provide correct parameters/options for the pipeline!

I tried using the alternative gffread but then rsem finds corrupted the output gtf when trying to create the reference genome:

The GTF file might be corrupted! Stop at line : NC_000001.11 BestRefSeq exon 17369 17391 . - . transcript_id "rna3"; Error Message: Cannot find gene_id!

winni2k commented 5 years ago

I'm not sure this helps, but I have abandoned trying to convert GFFs to GTFs. Instead I now just use the files provided by ENSEMBL at ftp://ftp.ensembl.org/pub/release-95 (They come with versions!).

EidrianGM commented 5 years ago

Thanks for your quick reply @winni2k ... I think you are right, it will be for the best to use directly ensembl genome and gtf. I am a little obfuscated with this issue since I wanted to use the latest ncbi Hs reference GRCh38.p12 (ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/vertebrate_mammalian/Homo_sapiens/reference/GCF_000001405.38_GRCh38.p12) though I guess there shouldn't be much different.

If it helps to someboy RSEM could successfully use the gtf outputed by gffread specifying the flag -C (coding only discard mRNAs that have no CDS features) when transforming the ncbi gff:

gffread GCF_000001405.38_GRCh38.p12_genomic.gff -E -T -C -o GCF_000001405.38_GRCh38.p12_genomic.gtf

winni2k commented 5 years ago

Alas, the mismatching reference genome is an issue. I did end up switching to the Ensembl reference genome as well for my analyses :/

EidrianGM commented 5 years ago

Now It seems that I have made possible to use ncbi gff and reference withoutproblems, It just needed to have the flag --gff3-RNA-patterns mRNA,rRNA

This was my command:

rsem-prepare-reference --gff3 _GCF_000001405.38_GRCh38.p12genomic.gff --gff3-RNA-patterns mRNA,rRNA --trusted-sources BestRefSeq,Curated\ Genomic --star --star-path _/SeqTools/STAR-2.6.0a/bin/Linux_x8664 _GCF_000001405.38_GRCh38.p12primassembly.fna ./human_refseq

I hope it suits you well