deweylab / RSEM

RSEM: accurate quantification of gene and isoform expression from RNA-Seq data
http://deweylab.biostat.wisc.edu/rsem/
GNU General Public License v3.0
408 stars 118 forks source link

rsem-prepare-reference Refseq #29

Closed jmwhitha closed 7 years ago

jmwhitha commented 7 years ago

Hello,

When I run rsem-prepare-reference with refseq it only captures "exon" in my gff3 file. I'm not sure why the gff3 even has exons because I work with a bacterium. I only realized this issue because the output from rsem-calculate-expression was a very short list of genes. Any suggestion on how I can get rsem-prepare-reference to capture "gene" rather than "exon"? Would it be okay just to change the third column values in GFF3 from "gene" to "exon"? The concern that I have with doing that is that all the ones that currently have "exon" in the third column also have and ID=rna#;Parent=gene# in the 9th column, and these are put in the second and first column of the rsem-calculate-expression output, respectively. I'm not sure, but there might also or alternatively be an issue associated with transcript sources. Only lines from the gff3 file with tRNAscan-SE in the second colomn, not RefSeq, are retained after rsem-prepare-reference. These lines are the same lines with "exon" in the third column. Before I start manually editing the gff3 file, I was wondering if I'm missing something critical in the pipeline directions. In case you'd like to look at the gff3 file, it's at the url below. And so are some lines of the gff3 file and all of the lines of the short RSEM.genes.results file. An example of "exon" in the third column is close to the end of the gff3 file excerpt.

Thank you for your help!

ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/bacteria/Ruminiclostridium_thermocellum/all_assembly_versions/GCF_000184925.1_ASM18492v1/

gff-version 3

!gff-spec-version 1.21

!processor NCBI annotwriter

!genome-build ASM18492v1

!genome-build-accession NCBI_Assembly:GCF_000184925.1

sequence-region NC_017304.1 1 3561619

species https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=637887

NC_017304.1 RefSeq region 1 3561619 . + . ID=id0;Dbxref=taxon:637887;Is_circular=true;Name=ANONYMOUS;gbkey=Src;genome=chromosome;mol_type=genomic DNA;strain=DSM 1313 NC_017304.1 RefSeq gene 212 1543 . + . ID=gene0;Name=CLO1313_RS00010;gbkey=Gene;gene_biotype=protein_coding;locus_tag=CLO1313_RS00010;old_locus_tag=Clo1313_0001 NC_017304.1 Protein Homology CDS 212 1543 . + 0 ID=cds0;Parent=gene0;Dbxref=Genbank:WP_003516465.1;Name=WP_003516465.1;gbkey=CDS;product=chromosomal replication initiator protein DnaA;protein_id=WP_003516465.1;transl_table=11 NC_017304.1 RefSeq gene 1793 2893 . + . ID=gene1;Name=CLO1313_RS00015;gbkey=Gene;gene_biotype=protein_coding;locus_tag=CLO1313_RS00015;old_locus_tag=Clo1313_0002 NC_017304.1 Protein Homology CDS 1793 2893 . + 0 ID=cds1;Parent=gene1;Dbxref=Genbank:WP_003513339.1;Name=WP_003513339.1;gbkey=CDS;product=DNA polymerase III subunit beta;protein_id=WP_003513339.1;transl_table=11 NC_017304.1 RefSeq gene 2909 3115 . + . ID=gene2;Name=CLO1313_RS00020;gbkey=Gene;gene_biotype=protein_coding;locus_tag=CLO1313_RS00020;old_locus_tag=Clo1313_0003 NC_017304.1 Protein Homology CDS 2909 3115 . + 0 ID=cds2;Parent=gene2;Dbxref=Genbank:WP_003513337.1;Name=WP_003513337.1;gbkey=CDS;product=RNA-binding protein S4;protein_id=WP_003513337.1;transl_table=11 NC_017304.1 RefSeq gene 3133 4242 . + . ID=gene3;Name=CLO1313_RS00025;gbkey=Gene;gene_biotype=protein_coding;locus_tag=CLO1313_RS00025;old_locus_tag=Clo1313_0004 NC_017304.1 Protein Homology CDS 3133 4242 . + 0 ID=cds3;Parent=gene3;Dbxref=Genbank:WP_003513335.1;Name=WP_003513335.1;gbkey=CDS;product=DNA recombination protein RecF;protein_id=WP_003513335.1;transl_table=11 NC_017304.1 RefSeq gene 4410 4724 . + . ID=gene4;Name=CLO1313_RS00030;gbkey=Gene;gene_biotype=protein_coding;locus_tag=CLO1313_RS00030;old_locus_tag=Clo1313_0005 NC_017304.1 Protein Homology CDS 4410 4724 . + 0 ID=cds4;Parent=gene4;Dbxref=Genbank:WP_003513333.1;Name=WP_003513333.1;gbkey=CDS;product=DUF370 domain-containing protein;protein_id=WP_003513333.1;transl_table=11 NC_017304.1 RefSeq gene 4773 6698 . + . ID=gene5;Name=gyrB;gbkey=Gene;gene=gyrB;gene_biotype=protein_coding;locus_tag=CLO1313_RS00035;old_locus_tag=Clo1313_0006 NC_017304.1 Protein Homology CDS 4773 6698 . + 0 ID=cds5;Parent=gene5;Dbxref=Genbank:WP_003513331.1;Name=WP_003513331.1;Note=negatively supercoils closed circular double-stranded DNA;gbkey=CDS;gene=gyrB;product=DNA gyrase subunit B;protein_id=WP_003513331.1;transl_table=11 NC_017304.1 RefSeq gene 7685 8461 . + . ID=gene6;Name=CLO1313_RS00040;gbkey=Gene;gene_biotype=protein_coding;locus_tag=CLO1313_RS00040;old_locus_tag=Clo1313_0007 NC_017304.1 Protein Homology CDS 7685 8461 . + 0 ID=cds6;Parent=gene6;Dbxref=Genbank:WP_003513308.1;Name=WP_003513308.1;gbkey=CDS;product=sporulation initiation inhibitor Soj;protein_id=WP_003513308.1;transl_table=11 NC_017304.1 RefSeq gene 8458 9327 . + . ID=gene7;Name=CLO1313_RS00045;gbkey=Gene;gene_biotype=protein_coding;locus_tag=CLO1313_RS00045;old_locus_tag=Clo1313_0008 NC_017304.1 Protein Homology CDS 8458 9327 . + 0 ID=cds7;Parent=gene7;Dbxref=Genbank:WP_003513306.1;Name=WP_003513306.1;gbkey=CDS;product=chromosome partitioning protein ParB;protein_id=WP_003513306.1;transl_table=11 NC_017304.1 RefSeq gene 9889 10398 . + . ID=gene8;Name=CLO1313_RS00050;gbkey=Gene;gene_biotype=protein_coding;locus_tag=CLO1313_RS00050;old_locus_tag=Clo1313_0009 NC_017304.1 Protein Homology CDS 9889 10398 . + 0 ID=cds8;Parent=gene8;Dbxref=Genbank:WP_003513304.1;Name=WP_003513304.1;gbkey=CDS;product=hypothetical protein;protein_id=WP_003513304.1;transl_table=11 NC_017304.1 RefSeq gene 10419 11159 . + . ID=gene9;Name=CLO1313_RS00055;gbkey=Gene;gene_biotype=protein_coding;locus_tag=CLO1313_RS00055;old_locus_tag=Clo1313_0010 NC_017304.1 Protein Homology CDS 10419 11159 . + 0 ID=cds9;Parent=gene9;Dbxref=Genbank:WP_003513303.1;Name=WP_003513303.1;gbkey=CDS;product=hypothetical protein;protein_id=WP_003513303.1;transl_table=11 NC_017304.1 RefSeq gene 11323 12594 . + . ID=gene10;Name=CLO1313_RS00060;gbkey=Gene;gene_biotype=protein_coding;locus_tag=CLO1313_RS00060;old_locus_tag=Clo1313_0011 NC_017304.1 Protein Homology CDS 11323 12594 . + 0 ID=cds10;Parent=gene10;Dbxref=Genbank:WP_003513301.1;Name=WP_003513301.1;gbkey=CDS;product=serine--tRNA ligase;protein_id=WP_003513301.1;transl_table=11 NC_017304.1 RefSeq gene 12849 12939 . + . ID=gene11;Name=CLO1313_RS00065;gbkey=Gene;gene_biotype=tRNA;locus_tag=CLO1313_RS00065;old_locus_tag=Clo1313_R0001 NC_017304.1 tRNAscan-SE tRNA 12849 12939 . + . ID=rna0;Parent=gene11;anticodon=(pos:12884..12886);gbkey=tRNA;product=tRNA-Ser NC_017304.1 tRNAscan-SE exon 12849 12939 . + . ID=id1;Parent=rna0;anticodon=(pos:12884..12886);gbkey=tRNA;product=tRNA-Ser NC_017304.1 RefSeq gene 12969 13062 . + . ID=gene12;Name=CLO1313_RS00070;gbkey=Gene;gene_biotype=tRNA;locus_tag=CLO1313_RS00070;old_locus_tag=Clo1313_R0002 NC_017304.1 tRNAscan-SE tRNA 12969 13062 . + . ID=rna1;Parent=gene12;anticodon=(pos:13004..13006);gbkey=tRNA;product=tRNA-Ser NC_017304.1 tRNAscan-SE exon 12969 13062 . + . ID=id2;Parent=rna1;anticodon=(pos:13004..13006);gbkey=tRNA;product=tRNA-Ser NC_017304.1 RefSeq gene 13113 14513 . - . ID=gene13;Name=CLO1313_RS00075;gbkey=Gene;gene_biotype=protein_coding;locus_tag=CLO1313_RS00075;old_locus_tag=Clo1313_0012

RSEM.genes.results gene_id transcript_id(s) length effective_length expected_count TPM FPKM gene1076 rna30 76.00 27.80 1.00 913.68 773.47 gene11 rna0 91.00 42.80 131.00 77743.97 65814.04 gene1187 rna31 87.00 38.80 15.00 9819.71 8312.86 gene1190 rna32 86.00 37.80 23.00 15455.22 13083.59 gene1191 rna33 76.00 27.80 26.00 23755.69 20110.35 gene1192 rna34 77.00 28.80 15.50 13670.31 11572.58 gene12 rna1 94.00 45.80 14.00 7764.29 6572.85 gene1267 rna35 83.00 34.80 2.00 1459.79 1235.78 gene1342 rna36 76.00 27.80 0.00 0.00 0.00 gene1343 rna37 77.00 28.80 15.50 13670.31 11572.58 gene1346 rna38 375.00 326.80 1462.00 113633.01 96195.85 gene1565 rna39 76.00 27.80 0.00 0.00 0.00 gene1574 rna40 78.00 29.80 1.00 852.36 721.56 gene1591 rna41 75.00 26.80 4.00 3791.09 3209.34 gene1668 rna42 76.00 27.80 2.00 1827.36 1546.95 gene1728 rna43 87.00 38.80 0.00 0.00 0.00 gene1729 rna44 76.00 27.80 0.00 0.00 0.00 gene1919 rna45 76.00 27.80 5.00 4568.40 3867.37 gene1920 rna46 76.00 27.80 4.00 3654.72 3093.90 gene1921 rna47 77.00 28.80 8.00 7055.64 5972.94 gene2037 rna48 84.00 35.80 5.00 3547.53 3003.16 gene2066 rna49 77.00 28.80 2.00 1763.91 1493.24 gene2067 rna50 74.00 25.80 4.00 3938.03 3333.74 gene2150 rna51 74.00 25.80 4.00 3938.03 3333.74 gene2151 rna52 75.00 26.80 9.75 9242.32 7824.07 gene2475 rna53 77.00 28.80 0.00 0.00 0.00 gene254 rna5 90.00 41.80 39.00 23698.86 20062.24 gene255 rna6 76.00 27.80 2.00 1827.36 1546.95 gene2578 rna54 75.00 26.80 0.00 0.00 0.00 gene2579 rna55 75.00 26.80 3.25 3078.73 2606.30 gene2580 rna56 76.00 27.80 26.00 23755.69 20110.35 gene2581 rna57 77.00 28.80 34.50 30427.46 25758.32 gene2582 rna58 76.00 27.80 2.33 2131.92 1804.77 gene2757 rna59 76.00 27.80 1.00 913.68 773.47 gene2758 rna60 74.00 25.80 1.00 984.51 833.43 gene276 rna7 77.00 28.80 5.00 4409.78 3733.09 gene277 rna8 75.00 26.80 18.00 17059.92 14442.05 gene278 rna9 76.00 27.80 2.33 2131.92 1804.77 gene282 rna10 75.00 26.80 10.00 9477.73 8023.36 gene2845 rna61 106.00 57.80 32.00 14062.46 11904.55 gene2883 rna62 76.00 27.80 0.00 0.00 0.00 gene2886 rna63 76.00 27.80 2.33 2131.92 1804.77 gene2888 rna64 76.00 27.80 11.00 10050.49 8508.22 gene3032 rna65 94.00 45.80 0.00 0.00 0.00 gene3071 rna66 117.00 68.80 2.25 830.68 703.21 gene3072 rna67 2914.00 2865.80 10323.83 91434.25 77403.53 gene3073 rna68 76.00 27.80 8.50 7766.28 6574.54 gene3074 rna69 1535.00 1486.80 1740.57 29731.66 25169.29 gene445 rna11 1535.00 1486.80 1740.57 29731.66 25169.29 gene446 rna12 77.00 28.80 10.00 8819.56 7466.18 gene447 rna13 2912.00 2863.80 10252.38 90855.47 76913.56 gene448 rna14 117.00 68.80 2.25 830.68 703.21 gene52 rna2 77.00 28.80 1.00 881.96 746.62 gene625 rna15 1535.00 1486.80 639.69 10929.38 9252.25 gene626 rna16 2912.00 2863.80 8838.44 78469.04 66427.85 gene627 rna17 117.00 68.80 2.25 830.68 703.21 gene657 rna18 76.00 27.80 5.00 4568.40 3867.37 gene658 rna19 75.00 26.80 18.00 17059.92 14442.05 gene659 rna20 77.00 28.80 34.50 30427.46 25758.32 gene660 rna21 76.00 27.80 7.00 6395.76 5414.32 gene661 rna22 85.00 36.80 0.00 0.00 0.00 gene679 rna23 1535.00 1486.80 1766.17 30180.14 25548.95 gene680 rna24 76.00 27.80 8.50 7766.28 6574.54 gene681 rna25 2912.00 2863.80 9156.36 81280.79 68808.13 gene682 rna26 117.00 68.80 2.25 830.68 703.21 gene683 rna27 76.00 27.80 0.00 0.00 0.00 gene747 rna28 77.00 28.80 0.00 0.00 0.00 gene87 rna3 77.00 28.80 0.00 0.00 0.00 gene89 rna4 91.00 42.80 2.00 1186.93 1004.79 gene970 rna29 74.00 25.80 1.00 984.51 833.43

jmwhitha commented 7 years ago

There error in my assessment. It's not the lines with "exon" but the ones with "tRNA" that are added to the new gff3 file after rsem-prepare-reference. Unlike the lines with "exon" in the third column, these have the gene# and rna#.

Also, in case you wanted my command: rsem-prepare-reference --gff3 GCF_000184925.1_ASM18492v1_genomic.gff \ --bowtie2 \ GCF_000184925.1_ASM18492v1_genomic.fna \ ./ref/Clo1313

Thank you again!

jmwhitha commented 7 years ago

So, I just removed everything but "CDS" and "gene" lines. rsem-prepare-reference gave a no attribute error. Then, I removed "gene" too and I got the error "reference contains no transcripts". How can I de novo make a GFF3 or GTF2 file that has the gene region assignments in GCF_000184925.1_ASM18492v1_genomic.gff and also transcript assignments? I have lots of RNAseq data. What would you recommend?

bli25wisc commented 7 years ago

Hi @jmwhitha, I'm a bit confused about your descriptions. Can you first let me know which version of RSEM are you using? Then can you share with me the GFF3 file and what kind of features you want to extract? I recommend you send an email to RSEM Google Users Group.

Hope it helps, Bo

bli25wisc commented 7 years ago

@jmwhitha , if you only want to select a subset of transcripts, you can use the --RNA-patterns option.

jmwhitha commented 7 years ago

Clo1313.gtf.txt GCF_000184925.1_ASM18492v1_genomic_RSEMinput.gff.txt GCF_000184925.1_ASM18492v1_genomic.gff.txt

Cool pic!

I'm using RSEM version 1.2.31.

I've attached the original GFF3 file (had to change to .txt to upload) and one that I modified to get the scripts to work. You'll see that I added lines for "mRNA" and "exon" (third column) and dummy ids in the 9th column. This ends up producing lines like the following in the /ref/Clo1313.gtf file (also attached) after running rsem-prepare-reference, and the rsem-calculate-expression provides TPM values for all of my genes! So I think I'm in business. However, I'd appreciate conformation from you and have another important question below.

Line from /ref/Clo1313.gtf: NC_017304.1 RefSeq exon 1793 2893 . + . gene_id "gene1"; transcript_id "rna1"; gene_name "CLO1313_RS00015";

NEW QUESTION: Speaking of Google Users Group, I found the following messages at https://groups.google.com/forum/#!to...rs/GoWPrFzMHMI, which brought up a new question- Multiple genes reside on one transcript. Is the suggestion below to designate transcript=rna1 for gene1, transcript=rna2 for gene2... and transcript=rnaN for geneN? If so, what are the implications in terms of TPM?

"Dear Colin,

I was wondering if the gff and the gbk file format were integrated in the rsem-1.2.21. I am receiving the following error while running the rsem-prepare-reference scritpt in a bacterial genome.

rsem-extract-reference-transcripts nitro 0 nitroatt1.gff 0 my_bacteria.fna The reference contains no transcripts! "rsem-extract-reference-transcripts nitro 0 nitroatt1.gff 0 my_bacteria.fna" failed! Plase check if you provide correct parameters/options for the pipeline!

The original file was downloaded as gbk from IMG website and converted to gff with Geneious. The file looks like:

my_bacteria_1126 Geneious source 1 506 . + . Name=my_bacteria;organism="my_bacteria";mol_type="genomic DNA";db_xref=taxon:203693;NCBI Feature Key=source my_bacteria_1126 Geneious gene 244 504 . + . locus_tag=my_bacteria_11261 my_bacteria_1126 Geneious CDS 244 504 . + . Name=hypothetical protein CDS;locus_tag=my_bacteria_11261;product="hypothetical protein";translation=XXXXXXXXX;NCBI Feature Key=CDS

Thanks for any insight.

All the best, Lucas


Dear Lucas,

Currently, RSEM only accepts GTF formatted annotation files, which are admittedly more targeted towards eukaryotic gene annotations in that they require “transcript” and “exon” lines. You could either add those lines in (for each gene, simply add a transcript and exon line with identical start and end coordinates), making sure you adhere to the GTF standard: http://mblab.wustl.edu/GTF22.html

or you could simply extract the sequences of the genes in a multi-fasta file and pass that directly to RSEM, in lieu of a genome+annotation.

Best, Colin"

bli25wisc commented 7 years ago

Hi @jmwhitha, yes, I think your modified GFF3 file should work. For your other question, I'm not sure if I understand. How come multiple genes can reside in one transcript?

Bo

bli25wisc commented 7 years ago

Hi @jmwhitha, can you let me know where you obtained the original GFF3 file? I found that Ensembl bacteria provide GTF files in which each gene contain a gene line, a transcript line and an exon line. See http://bacteria.ensembl.org/info/website/ftp/index.html

Thanks, Bo