Closed a92932000 closed 1 year ago
Yep, fiveUTRsByTranscript()
and threeUTRsByTranscript()
seem broken :disappointed: Thanks for the report. Working on a fix...
Fixed in GenomicFeatures 1.52.1 (release) and 1.53.1 (devel). See commit b0ac1936c2323c42131856122555175e132bf6fa
At the root of the problem is that makeTxDbFromGRanges()
did not infer the exon ranks correctly for the exons in this GFF3 file that are located on the minus strand. For example, for transcript Solyc00g007330.1.1
(2 exons), exon with ID exon:Solyc00g007330.1.1.1
was considered to be 1st in the transcript (i.e. rank 1), and exon with ID exon:Solyc00g007330.1.1.2
was considered to be 2nd in the transcript (i.e. rank 2). This is because makeTxDbFromGRanges()
was inferring their rank based on the ID's suffix (.1
and .2
). AFAIK this suffix usually reflects the rank, but not in the case of this GFF3 file where the ranks are 2 and 1, respectively.
Once the exons are imported in the TxDb object with incorrect ranks, a lot of operations on the TxDb object produce garbbage, not just fiveUTRsByTranscript()
or threeUTRsByTranscript()
. For example things like exonsBy(., by="tx")
or extractTranscriptSeqs()
will also produce incorrect results, so it's super important to get the exon ranks right.
Finally note that fiveUTRsByTranscript()
and threeUTRsByTranscript()
ignore the five_prime_UTR
and three_prime_UTR
features reported in the GFF3 file. Instead they infer the 5'UTRs and 3'UTRs from the exons and associated CDS. This is because makeTxDbFromGRanges()
does not import the five_prime_UTR
or three_prime_UTR
features in the TxDb object (these are not always available). Only the transcript, exon, and CDS features get imported.
GenomicFeatures 1.52.1 and 1.53.1 should become available via BiocManager::install()
in the next couple of days.
Best, H.
I'll close this now. Feel free to reopen if you still have issues with this.
If exon feature contains ID attribute in gff, it will make incorrect UTR annotations.
The number of UTRs is largely different between the two txdb objects, so I try to remove ID value of exon in GRanges. And it gives the same number as txdb from gtf file.
Example: The "Solyc00g007330.1.1" information in gff file:
The transcript has five_prime_UTR only, but txdb of gff file gives incorrect results