Open jayoung opened 2 years ago
Hi Janet,
The error message is a little bit misleading: it should say that makeTxDbFromGFF()
does not support coding-transcripts that encode more than one protein.
In the Canonical Gene example shown in the GFF3 specs, transcript EDEN.3
encodes 2 proteins (edenprotein.3
and edenprotein.4
), via 2 CDSs (cds00003
and cds00004
). makeTxDbFromGFF()
, and the underlying db schema used by TxDb objects, cannot handle this. Note that AFAIK the 2 major annotation providers UCSC and Ensembl don't handle this either. For example transcript tables at UCSC like knownGene
or ncbiRefSeq
use the cdsStart
and cdsEnd
fields to store a single CDS start and end value per transcript. IIRC the MySQL db used at Ensembl does something similar.
When we started working on TxDb objects back in 2010, our primary goal was to support UCSC and Ensembl annotations, so we designed a db schema that assumes a one-to-one relationship between coding transcripts and CDSs/proteins. So when you import the Canonical Gene example, only the first CDS on transcript EDEN.3 is imported:
library(txdbmaker)
EDEN <- system.file("extdata", "GFF3_files", "TheCanonicalGene_v1.gff3", package="txdbmaker")
txdb <- makeTxDbFromGFF(EDEN)
# Import genomic features from the file as a GRanges object ... OK
# Prepare the 'metadata' data frame ... OK
# Make the TxDb object ... OK
# Warning message:
# In .find_exon_cds(exons, cds) :
# The following transcripts have exons that contain more than one CDS
# (only the first CDS was kept for each exon): mRNA00003
cdsBy(txdb, by="tx", use.names=TRUE)
# GRangesList object of length 3:
# $EDEN.1
# GRanges object with 4 ranges and 3 metadata columns:
# seqnames ranges strand | cds_id cds_name exon_rank
# <Rle> <IRanges> <Rle> | <integer> <character> <integer>
# [1] ctg123 1201-1500 + | 1 edenprotein.1 1
# [2] ctg123 3000-3902 + | 3 edenprotein.1 2
# [3] ctg123 5000-5500 + | 5 edenprotein.1 3
# [4] ctg123 7000-7600 + | 8 edenprotein.1 4
# -------
# seqinfo: 1 sequence from an unspecified genome; no seqlengths
#
# $EDEN.2
# GRanges object with 3 ranges and 3 metadata columns:
# seqnames ranges strand | cds_id cds_name exon_rank
# <Rle> <IRanges> <Rle> | <integer> <character> <integer>
# [1] ctg123 1201-1500 + | 2 edenprotein.2 1
# [2] ctg123 5000-5500 + | 6 edenprotein.2 2
# [3] ctg123 7000-7600 + | 9 edenprotein.2 3
# -------
# seqinfo: 1 sequence from an unspecified genome; no seqlengths
#
# $EDEN.3
# GRanges object with 3 ranges and 3 metadata columns:
# seqnames ranges strand | cds_id cds_name exon_rank
# <Rle> <IRanges> <Rle> | <integer> <character> <integer>
# [1] ctg123 3301-3902 + | 4 edenprotein.3 2
# [2] ctg123 5000-5500 + | 7 edenprotein.3 3
# [3] ctg123 7000-7600 + | 10 edenprotein.3 4
# -------
# seqinfo: 1 sequence from an unspecified genome; no seqlengths
The one-to-one relationship between coding transcripts and CDSs/proteins is a very early assumption that is at the root of the design/behavior of many functions in GenomicFeatures/txdbmaker, so is something that would be very difficult to change, and also very disruptive.
I knew that the GFF3 specs did support a one-to-many relationship between coding transcripts and CDSs/proteins, and that's a feature that I found surprising when I ran into it, because I was not aware of any real-world situation where this would actually happen, and because AFAIK none of the hundreds of thousands of GFF3 files provided by UCSC and Ensembl seemed to use this capability. Also makeTxDbFromGFF()
has been used on hundreds or thousands of GFF3 files from FlyBase, WormBase, and other annotations providers, for many years, and AFAIK nobody has reported seeing that. So congratulations for finding and reporting the first case of such file :wink:! Out of curiosity, may I know where is this file coming from, if that's something you're willing to share?
I don't know how UCSC or Ensembl handle this situation, or how frequently it is observed in biology. Maybe they just use 2 distinct transcript ids to represent such event? In your case it means that the GFF3 file would need to be modified but I don't know how hard that would be. If this is a viral genome, then there's no exon, and each CDS can be represented by a single line (no CDS parts), so the file should be much simpler than in the general situation where the gene/transcript/exon/cds hierarchy can be complex and hard to figure out. Maybe makeTxDbFromGFF()
should be able to do something this like e.g. by importing 2 transcripts instead of one and assigning the corresponding protein IDs to them, or by adding the .1
and .2
suffixes to the original transcript ID? I'm open to suggestions.
H.
hi Herve,
thanks - this is useful insight into the back end. I think a good ad hoc solution in my case will be to make duplicate coding transcripts to achieve that one-to-one relationship. If that can be done in makeTxDbFromGFF
, great, if not I can probably handle it myself. I like the .1/.2
suffix idea.
I think over the next few years having >1 ORF per transcript will become more common. There's good evidence that a lot of additional ORFs exist: I think the jury is still out on how many of them have important function, but it is clear that at least some of them are important. This review article covers the biology nicely, and this preprint has more in-depth analysis on how important it may (or may not be) in the human genome. There is also discussion in that preprint about how to represent these additional ORFs in databases: I haven't looked to see where they're at with that, but it sounds like it's an active area at EBI.
Viruses can have introns! Not often, but they do sometimes. I'm working with HSV-1 sequences - the 'canonical' annotation has just a handful of introns. A more recent and comprehensive [annotation[(https://www.nature.com/articles/s41467-020-15992-5) based on deep genomics+proteomics has a lot more introns (and probably some noise).
The gffs I have been working with are frankly a mess, and what I see there is unlikely to be generalizable. The annotations I'm trying to use seem to be only available in Genbank format - here. NCBI's website has a way to export that to gff3 (menu options 'send to' - 'file' - 'complete record' - 'gff3'). I can already see that exported gff3 has inconsistencies: some of CDSs have mRNA parents, some have a gene as parent instead.
I've also played a bit with the genbankr
package and its readGenBank()
and makeTxDbFromGenBank()
functions, but again I think the annotations are not organized in a helpful way - I get an error about duplicated transcript IDs from makeTxDbFromGenBank()
.
Probably best not to attempt to solve this particular case - I'm trying to keep my eyes on the big picture for this project (which is actually to use VEP with the exported gff3 files after I try to fix some of these inconsistencies).
thanks again,
Janet
Regarding my own messed up gff3 file (I'm sure I will google my own question 5 years from now and want to know how I solved it) - turns out I am not the first person to have trouble with NCBI gff files. This post is old (and this), but I am seeing similar issues even now.
For the annotations I care about, I think I have managed to create a new gff3 file that makes much more sense, by extracting co-ordinates from an Excel file (horror!) of a published supplementary dataset.
Now that I have full control over how that gff3 gets constructed, I make sure there's one transcript per CDS, and I can now import using makeTxDbFromGFF
just fine.
Back to my original suggestion, about supporting >1 CDS per transcript (or perhaps your thought about building in a way to duplicate transcripts, to allow for this without changing the schema): I do think this still could be useful in future, but is not something I need at the moment. Perhaps it makes sense to wait and see how Ensembl/UCSC end up representing the upstream ORFs described in that preprint.
thanks, Herve!
hi there,
I am working on a viral genome, where the annotations (based on ribosome profiling) fairly often include >1 CDS per exon. Using
makeTxDbFromGFF
means that some of these CDSs get dropped with a warningThe following transcripts have exons that contain more than one CDS (only the first CDS was kept for each exon)
. I'd like to be able to keep all the CDSs.I'm not sure if this website is what determines 'official' gff3 specs, but it suggests that >1 CDS per exon should be allowed: https://github.com/The-Sequence-Ontology/Specifications/blob/master/gff3.md
Any change that makeTxDbFromGFF could support >1 CDS per exon? Not sure how tricky that would be. Example data below, from the website linked above.
Thanks!
Janet
Here's the example gff file from that page:
And here's the warning - we lose the edenprotein.4 annotation