Bioconductor / txdbmaker

A set of tools for making TxDb objects from genomic annotations from various sources (e.g. UCSC, Ensembl, and GFF files)
2 stars 0 forks source link

support transcripts with >1 CDS? #3

Open jayoung opened 2 years ago

jayoung commented 2 years ago

hi there,

I am working on a viral genome, where the annotations (based on ribosome profiling) fairly often include >1 CDS per exon. Using makeTxDbFromGFF means that some of these CDSs get dropped with a warning The following transcripts have exons that contain more than one CDS (only the first CDS was kept for each exon). I'd like to be able to keep all the CDSs.

I'm not sure if this website is what determines 'official' gff3 specs, but it suggests that >1 CDS per exon should be allowed: https://github.com/The-Sequence-Ontology/Specifications/blob/master/gff3.md

Any change that makeTxDbFromGFF could support >1 CDS per exon? Not sure how tricky that would be. Example data below, from the website linked above.

Thanks!

Janet

Here's the example gff file from that page:

##gff-version 3
##sequence-region ctg123 1 1497228
ctg123  .   gene    1000    9000    .   +   .   ID=gene00001;Name=EDEN
ctg123  .   TF_binding_site 1000    1012    .   +   .   ID=tfbs00001;Parent=gene00001
ctg123  .   mRNA    1050    9000    .   +   .   ID=mRNA00001;Parent=gene00001;Name=EDEN.1
ctg123  .   mRNA    1050    9000    .   +   .   ID=mRNA00002;Parent=gene00001;Name=EDEN.2
ctg123  .   mRNA    1300    9000    .   +   .   ID=mRNA00003;Parent=gene00001;Name=EDEN.3
ctg123  .   exon    1300    1500    .   +   .   ID=exon00001;Parent=mRNA00003
ctg123  .   exon    1050    1500    .   +   .   ID=exon00002;Parent=mRNA00001,mRNA00002
ctg123  .   exon    3000    3902    .   +   .   ID=exon00003;Parent=mRNA00001,mRNA00003
ctg123  .   exon    5000    5500    .   +   .   ID=exon00004;Parent=mRNA00001,mRNA00002,mRNA00003
ctg123  .   exon    7000    9000    .   +   .   ID=exon00005;Parent=mRNA00001,mRNA00002,mRNA00003
ctg123  .   CDS 1201    1500    .   +   0   ID=cds00001;Parent=mRNA00001;Name=edenprotein.1
ctg123  .   CDS 3000    3902    .   +   0   ID=cds00001;Parent=mRNA00001;Name=edenprotein.1
ctg123  .   CDS 5000    5500    .   +   0   ID=cds00001;Parent=mRNA00001;Name=edenprotein.1
ctg123  .   CDS 7000    7600    .   +   0   ID=cds00001;Parent=mRNA00001;Name=edenprotein.1
ctg123  .   CDS 1201    1500    .   +   0   ID=cds00002;Parent=mRNA00002;Name=edenprotein.2
ctg123  .   CDS 5000    5500    .   +   0   ID=cds00002;Parent=mRNA00002;Name=edenprotein.2
ctg123  .   CDS 7000    7600    .   +   0   ID=cds00002;Parent=mRNA00002;Name=edenprotein.2
ctg123  .   CDS 3301    3902    .   +   0   ID=cds00003;Parent=mRNA00003;Name=edenprotein.3
ctg123  .   CDS 5000    5500    .   +   1   ID=cds00003;Parent=mRNA00003;Name=edenprotein.3
ctg123  .   CDS 7000    7600    .   +   1   ID=cds00003;Parent=mRNA00003;Name=edenprotein.3
ctg123  .   CDS 3391    3902    .   +   0   ID=cds00004;Parent=mRNA00003;Name=edenprotein.4
ctg123  .   CDS 5000    5500    .   +   1   ID=cds00004;Parent=mRNA00003;Name=edenprotein.4
ctg123  .   CDS 7000    7600    .   +   1   ID=cds00004;Parent=mRNA00003;Name=edenprotein.4

And here's the warning - we lose the edenprotein.4 annotation

library(txdbmaker)
example_txdb <- makeTxDbFromGFF(file="~/Desktop/example.gff3")
# Import genomic features from the file as a GRanges object ... OK
# Prepare the 'metadata' data frame ... OK
# Make the TxDb object ... OK
# Warning message:
#     In .find_exon_cds(exons, cds) :
#     The following transcripts have exons that contain more than one CDS (only the
#     first CDS was kept for each exon): mRNA00003

cds(example_txdb, use.names=TRUE)
# GRanges object with 10 ranges and 1 metadata column:
#               seqnames    ranges strand |    cds_id
#                   <Rle> <IRanges>  <Rle> | <integer>
#  edenprotein.1   ctg123 1201-1500      + |         1
#  edenprotein.2   ctg123 1201-1500      + |         2
#  edenprotein.1   ctg123 3000-3902      + |         3
#  edenprotein.3   ctg123 3301-3902      + |         4
#  edenprotein.1   ctg123 5000-5500      + |         5
#  edenprotein.2   ctg123 5000-5500      + |         6
#  edenprotein.3   ctg123 5000-5500      + |         7
#  edenprotein.1   ctg123 7000-7600      + |         8
#  edenprotein.2   ctg123 7000-7600      + |         9
#  edenprotein.3   ctg123 7000-7600      + |        10
hpages commented 2 years ago

Hi Janet,

The error message is a little bit misleading: it should say that makeTxDbFromGFF() does not support coding-transcripts that encode more than one protein.

In the Canonical Gene example shown in the GFF3 specs, transcript EDEN.3 encodes 2 proteins (edenprotein.3 and edenprotein.4), via 2 CDSs (cds00003 and cds00004). makeTxDbFromGFF(), and the underlying db schema used by TxDb objects, cannot handle this. Note that AFAIK the 2 major annotation providers UCSC and Ensembl don't handle this either. For example transcript tables at UCSC like knownGene or ncbiRefSeq use the cdsStart and cdsEnd fields to store a single CDS start and end value per transcript. IIRC the MySQL db used at Ensembl does something similar.

When we started working on TxDb objects back in 2010, our primary goal was to support UCSC and Ensembl annotations, so we designed a db schema that assumes a one-to-one relationship between coding transcripts and CDSs/proteins. So when you import the Canonical Gene example, only the first CDS on transcript EDEN.3 is imported:

library(txdbmaker)

EDEN <- system.file("extdata", "GFF3_files", "TheCanonicalGene_v1.gff3", package="txdbmaker")

txdb <- makeTxDbFromGFF(EDEN)
# Import genomic features from the file as a GRanges object ... OK
# Prepare the 'metadata' data frame ... OK
# Make the TxDb object ... OK
# Warning message:
# In .find_exon_cds(exons, cds) :
#   The following transcripts have exons that contain more than one CDS
#  (only the first CDS was kept for each exon): mRNA00003

cdsBy(txdb, by="tx", use.names=TRUE)
# GRangesList object of length 3:
# $EDEN.1
# GRanges object with 4 ranges and 3 metadata columns:
#       seqnames    ranges strand |    cds_id      cds_name exon_rank
#          <Rle> <IRanges>  <Rle> | <integer>   <character> <integer>
#   [1]   ctg123 1201-1500      + |         1 edenprotein.1         1
#   [2]   ctg123 3000-3902      + |         3 edenprotein.1         2
#   [3]   ctg123 5000-5500      + |         5 edenprotein.1         3
#   [4]   ctg123 7000-7600      + |         8 edenprotein.1         4
#   -------
#   seqinfo: 1 sequence from an unspecified genome; no seqlengths
# 
# $EDEN.2
# GRanges object with 3 ranges and 3 metadata columns:
#       seqnames    ranges strand |    cds_id      cds_name exon_rank
#          <Rle> <IRanges>  <Rle> | <integer>   <character> <integer>
#   [1]   ctg123 1201-1500      + |         2 edenprotein.2         1
#   [2]   ctg123 5000-5500      + |         6 edenprotein.2         2
#   [3]   ctg123 7000-7600      + |         9 edenprotein.2         3
#   -------
#   seqinfo: 1 sequence from an unspecified genome; no seqlengths
# 
# $EDEN.3
# GRanges object with 3 ranges and 3 metadata columns:
#       seqnames    ranges strand |    cds_id      cds_name exon_rank
#          <Rle> <IRanges>  <Rle> | <integer>   <character> <integer>
#   [1]   ctg123 3301-3902      + |         4 edenprotein.3         2
#   [2]   ctg123 5000-5500      + |         7 edenprotein.3         3
#   [3]   ctg123 7000-7600      + |        10 edenprotein.3         4
#   -------
#   seqinfo: 1 sequence from an unspecified genome; no seqlengths

The one-to-one relationship between coding transcripts and CDSs/proteins is a very early assumption that is at the root of the design/behavior of many functions in GenomicFeatures/txdbmaker, so is something that would be very difficult to change, and also very disruptive.

I knew that the GFF3 specs did support a one-to-many relationship between coding transcripts and CDSs/proteins, and that's a feature that I found surprising when I ran into it, because I was not aware of any real-world situation where this would actually happen, and because AFAIK none of the hundreds of thousands of GFF3 files provided by UCSC and Ensembl seemed to use this capability. Also makeTxDbFromGFF() has been used on hundreds or thousands of GFF3 files from FlyBase, WormBase, and other annotations providers, for many years, and AFAIK nobody has reported seeing that. So congratulations for finding and reporting the first case of such file :wink:! Out of curiosity, may I know where is this file coming from, if that's something you're willing to share?

I don't know how UCSC or Ensembl handle this situation, or how frequently it is observed in biology. Maybe they just use 2 distinct transcript ids to represent such event? In your case it means that the GFF3 file would need to be modified but I don't know how hard that would be. If this is a viral genome, then there's no exon, and each CDS can be represented by a single line (no CDS parts), so the file should be much simpler than in the general situation where the gene/transcript/exon/cds hierarchy can be complex and hard to figure out. Maybe makeTxDbFromGFF() should be able to do something this like e.g. by importing 2 transcripts instead of one and assigning the corresponding protein IDs to them, or by adding the .1 and .2 suffixes to the original transcript ID? I'm open to suggestions.

H.

jayoung commented 2 years ago

hi Herve,

thanks - this is useful insight into the back end. I think a good ad hoc solution in my case will be to make duplicate coding transcripts to achieve that one-to-one relationship. If that can be done in makeTxDbFromGFF, great, if not I can probably handle it myself. I like the .1/.2 suffix idea.

I think over the next few years having >1 ORF per transcript will become more common. There's good evidence that a lot of additional ORFs exist: I think the jury is still out on how many of them have important function, but it is clear that at least some of them are important. This review article covers the biology nicely, and this preprint has more in-depth analysis on how important it may (or may not be) in the human genome. There is also discussion in that preprint about how to represent these additional ORFs in databases: I haven't looked to see where they're at with that, but it sounds like it's an active area at EBI.

Viruses can have introns! Not often, but they do sometimes. I'm working with HSV-1 sequences - the 'canonical' annotation has just a handful of introns. A more recent and comprehensive [annotation[(https://www.nature.com/articles/s41467-020-15992-5) based on deep genomics+proteomics has a lot more introns (and probably some noise).

The gffs I have been working with are frankly a mess, and what I see there is unlikely to be generalizable. The annotations I'm trying to use seem to be only available in Genbank format - here. NCBI's website has a way to export that to gff3 (menu options 'send to' - 'file' - 'complete record' - 'gff3'). I can already see that exported gff3 has inconsistencies: some of CDSs have mRNA parents, some have a gene as parent instead.

I've also played a bit with the genbankr package and its readGenBank() and makeTxDbFromGenBank() functions, but again I think the annotations are not organized in a helpful way - I get an error about duplicated transcript IDs from makeTxDbFromGenBank().

Probably best not to attempt to solve this particular case - I'm trying to keep my eyes on the big picture for this project (which is actually to use VEP with the exported gff3 files after I try to fix some of these inconsistencies).

thanks again,

Janet

jayoung commented 2 years ago

Regarding my own messed up gff3 file (I'm sure I will google my own question 5 years from now and want to know how I solved it) - turns out I am not the first person to have trouble with NCBI gff files. This post is old (and this), but I am seeing similar issues even now.

For the annotations I care about, I think I have managed to create a new gff3 file that makes much more sense, by extracting co-ordinates from an Excel file (horror!) of a published supplementary dataset.

Now that I have full control over how that gff3 gets constructed, I make sure there's one transcript per CDS, and I can now import using makeTxDbFromGFF just fine.

Back to my original suggestion, about supporting >1 CDS per transcript (or perhaps your thought about building in a way to duplicate transcripts, to allow for this without changing the schema): I do think this still could be useful in future, but is not something I need at the moment. Perhaps it makes sense to wait and see how Ensembl/UCSC end up representing the upstream ORFs described in that preprint.

thanks, Herve!