Bioconductor / GenomicFeatures

Query the gene models of a given organism/assembly
https://bioconductor.org/packages/GenomicFeatures
26 stars 12 forks source link

fiveUTRsByTranscript() and threeUTRsByTranscript() give incorrect UTRs #59

Closed a92932000 closed 1 year ago

a92932000 commented 1 year ago

If exon feature contains ID attribute in gff, it will make incorrect UTR annotations.

# gff data is downloaded from https://solgenomics.net/ftp/tomato_genome/annotation/ITAG4.1_release/
>gff="ITAG4.1_gene_models.gff" 
>gff.gr<-import(gff)

>gtf="ITAG4.1_gene_models.gtf" # covert gff into gtf by gffread -T 
>gtf.gr<-import(gtf)

>table(gff.gr$type)
           gene            mRNA            exon             CDS three_prime_UTR  five_prime_UTR 
          34688           34688          164368          156076           21858           23098

>table(gtf.gr$type)
     transcript       exon        CDS 
          34688     164368     156076

>txdb.gff<-makeTxDbFromGRanges(gr = gff.gr)
>UTR5.gff.gr <- fiveUTRsByTranscript(txdb.gff)
>UTR3.gff.gr <- threeUTRsByTranscript(txdb.gff)

>txdb.gtf<-makeTxDbFromGRanges(gr = gtf.gr)
>UTR5.gtf.gr <- fiveUTRsByTranscript(txdb.gtf)
>UTR3.gtf.gr <- threeUTRsByTranscript(txdb.gtf)

>sapply(list("three_prime_UTR"=UTR3.gff.gr,"five_prime_UTR"=UTR5.gff.gr),function(x) sum(elementNROWS(x)))
three_prime_UTR  five_prime_UTR 
          15127           14800
>sapply(list("three_prime_UTR"=UTR3.gtf.gr,"five_prime_UTR"=UTR5.gtf.gr),function(x) sum(elementNROWS(x)))
three_prime_UTR  five_prime_UTR
          21859           23099

The number of UTRs is largely different between the two txdb objects, so I try to remove ID value of exon in GRanges. And it gives the same number as txdb from gtf file.

>gff.noID.gr <- copy(gff.gr)
>gff.noID.gr[gff.noID.gr$type %in% c("exon")]$ID<-NA
>txdb.noID.gff    <- makeTxDbFromGRanges(gr = gff.noID.gr)
>UTR5.gff.noID.gr <- fiveUTRsByTranscript(txdb.noID.gff)
>UTR3.gff.noID.gr <- threeUTRsByTranscript(txdb.noID.gff)
>sapply(list("three_prime_UTR"=UTR3.gff.noID.gr,"five_prime_UTR"=UTR5.gff.noID.gr),function(x) sum(elementNROWS(x)))
three_prime_UTR  five_prime_UTR 
          21859           23099

Example: The "Solyc00g007330.1.1" information in gff file:

SL4.0ch00   maker_ITAG  gene    2379604 2380807 .   -   .   ID=gene:Solyc00g007330.1;Alias=Solyc00g007330;Name=Solyc00g007330.1;length=1203
SL4.0ch00   maker_ITAG  mRNA    2379604 2380807 .   -   .   ID=mRNA:Solyc00g007330.1.1;Parent=gene:Solyc00g007330.1;Name=Solyc00g007330.1.1;Note=Zinc finger transcription factor 1;_AED=1.00;_QI=373|0|0|0|0|0|2|0|171;_eAED=1.00
SL4.0ch00   maker_ITAG  CDS 2379604 2380119 .   -   0   ID=CDS:Solyc00g007330.1.1.1;Parent=mRNA:Solyc00g007330.1.1
SL4.0ch00   maker_ITAG  exon    2379604 2380324 .   -   .   ID=exon:Solyc00g007330.1.1.1;Parent=mRNA:Solyc00g007330.1.1
SL4.0ch00   maker_ITAG  five_prime_UTR  2380120 2380324 .   -   .   ID=five_prime_UTR:Solyc00g007330.1.1.0;Parent=mRNA:Solyc00g007330.1.1
SL4.0ch00   maker_ITAG  exon    2380640 2380807 .   -   .   ID=exon:Solyc00g007330.1.1.2;Parent=mRNA:Solyc00g007330.1.1
SL4.0ch00   maker_ITAG  five_prime_UTR  2380640 2380807 .   -   .   ID=five_prime_UTR:Solyc00g007330.1.1.1;Parent=mRNA:Solyc00g007330.1.1

The transcript has five_prime_UTR only, but txdb of gff file gives incorrect results

> UTR5.gff.gr["Solyc00g007330.1.1"]
GRangesList object of length 1:
$Solyc00g007330.1.1
GRanges object with 1 range and 3 metadata columns:
       seqnames          ranges strand |   exon_id            exon_name exon_rank
          <Rle>       <IRanges>  <Rle> | <integer>          <character> <integer>
  [1] SL4.0ch00 2380120-2380324      - |       815 Solyc00g007330.1.1.1         1
  -------
  seqinfo: 13 sequences from an unspecified genome; no seqlengths
> UTR3.gff.gr["Solyc00g007330.1.1"]
GRanges object with 1 range and 3 metadata columns:
       seqnames          ranges strand |   exon_id            exon_name exon_rank
          <Rle>       <IRanges>  <Rle> | <integer>          <character> <integer>
  [1] SL4.0ch00 2380640-2380807      - |       816 Solyc00g007330.1.1.2         2
  -------
  seqinfo: 13 sequences from an unspecified genome; no seqlengths
hpages commented 1 year ago

Yep, fiveUTRsByTranscript() and threeUTRsByTranscript() seem broken :disappointed: Thanks for the report. Working on a fix...

hpages commented 1 year ago

Fixed in GenomicFeatures 1.52.1 (release) and 1.53.1 (devel). See commit b0ac1936c2323c42131856122555175e132bf6fa

At the root of the problem is that makeTxDbFromGRanges() did not infer the exon ranks correctly for the exons in this GFF3 file that are located on the minus strand. For example, for transcript Solyc00g007330.1.1 (2 exons), exon with ID exon:Solyc00g007330.1.1.1 was considered to be 1st in the transcript (i.e. rank 1), and exon with ID exon:Solyc00g007330.1.1.2 was considered to be 2nd in the transcript (i.e. rank 2). This is because makeTxDbFromGRanges() was inferring their rank based on the ID's suffix (.1 and .2). AFAIK this suffix usually reflects the rank, but not in the case of this GFF3 file where the ranks are 2 and 1, respectively.

Once the exons are imported in the TxDb object with incorrect ranks, a lot of operations on the TxDb object produce garbbage, not just fiveUTRsByTranscript() or threeUTRsByTranscript(). For example things like exonsBy(., by="tx") or extractTranscriptSeqs() will also produce incorrect results, so it's super important to get the exon ranks right.

Finally note that fiveUTRsByTranscript() and threeUTRsByTranscript() ignore the five_prime_UTR and three_prime_UTR features reported in the GFF3 file. Instead they infer the 5'UTRs and 3'UTRs from the exons and associated CDS. This is because makeTxDbFromGRanges() does not import the five_prime_UTR or three_prime_UTR features in the TxDb object (these are not always available). Only the transcript, exon, and CDS features get imported.

GenomicFeatures 1.52.1 and 1.53.1 should become available via BiocManager::install() in the next couple of days.

Best, H.

hpages commented 1 year ago

I'll close this now. Feel free to reopen if you still have issues with this.