Closed mjsteinbaugh closed 3 years ago
Here's a simple reprex using rtracklayer showing that Ensembl GFF3 files contain the correct number of genes and transcripts:
suppressPackageStartupMessages({
library(rtracklayer)
})
file <- "ftp://ftp.ensembl.org/pub/release-102/gff3/homo_sapiens/Homo_sapiens.GRCh38.102.gff3.gz"
object <- import(file)
## Genes.
keep <- !is.na(mcols(object)[["gene_id"]])
sum(keep)
## [1] 60675
## CORRECT.
genes <- object[keep]
## Transcripts.
keep <- !is.na(mcols(object)[["transcript_id"]])
sum(keep)
## [1] 232024
## CORRECT.
transcripts <- object[keep]
Hi @mjsteinbaugh ,
Thanks for the report and sorry for the late response.
First let's clarify the nb of genes and transcripts contained in those files once for all. From the Unix shell:
In the GTF file:
cut -f 3 Homo_sapiens.GRCh38.102.gtf | grep -w gene | wc --lines
# 60675
cut -f 3 Homo_sapiens.GRCh38.102.gtf | grep -w transcript | wc --lines
# 232024
In the GFF3 file:
grep 'ID=gene:' Homo_sapiens.GRCh38.102.gff3 | wc --lines
# 60675
grep 'ID=transcript:' Homo_sapiens.GRCh38.102.gff3 | wc --lines
# 232024
So the good news is that the counts are the same in the GTF and GFF3 files.
The reason why makeTxDbFromGFF()
was missing a lot of genes and transcripts is because Ensembl has apparently started to use new Sequence Ontology terms like ncRNA_gene
, unconfirmed_transcript
, V_gene_segment
, D_gene_segment
, etc... in the type
column of their GFF3 files, and makeTxDbFromGFF()
was not able to recognize these terms as "gene features" or "transcript features".
Note that the problem of recognizing these features is specific to GFF3 files (GTF files always use the terms gene
or transcript
which makes life much easier) and I don't have a good solution to deal with this at the moment.
Anyways, this is addressed in GenomicFeatures 1.43.5 (devel, see commit d7f5980fea7b74489afd8ef000e8adc486553950) and GenomicFeatures 1.42.2 (release).
However, one issue remains:
library(GenomicFeatures)
txdb_gff3 <- makeTxDbFromGFF("Homo_sapiens.GRCh38.102.gff3")
length(transcripts(txdb_gff3))
[1] 232024
## You must use single.strand.genes.only=FALSE to extract all genes.
## See ?genes
all_genes <- genes(txdb_gff3, single.strand.genes.only=FALSE)
length(all_genes)
[1] 59471
Some genes are still missing!
The reason for this is that makeTxDbFromGFF()
imports the Name
tag from the GFF3, rather than the ID
or gene_id
tag:
grep 'ID=gene:' Homo_sapiens.GRCh38.102.gff3 | cut -f 9 | cut -d ';' -f 1 | head
# ID=gene:ENSG00000223972
# ID=gene:ENSG00000227232
# ID=gene:ENSG00000278267
# ID=gene:ENSG00000243485
# ID=gene:ENSG00000284332
# ID=gene:ENSG00000237613
# ID=gene:ENSG00000268020
# ID=gene:ENSG00000240361
# ID=gene:ENSG00000186092
# ID=gene:ENSG00000238009
grep 'ID=gene:' Homo_sapiens.GRCh38.102.gff3 | cut -f 9 | cut -d ';' -f 5 | head
# gene_id=ENSG00000223972
# gene_id=ENSG00000227232
# gene_id=ENSG00000278267
# gene_id=ENSG00000243485
# gene_id=ENSG00000284332
# gene_id=ENSG00000237613
# gene_id=ENSG00000268020
# gene_id=ENSG00000240361
# gene_id=ENSG00000186092
# gene_id=ENSG00000238009
grep 'ID=gene:' Homo_sapiens.GRCh38.102.gff3 | cut -f 9 | cut -d ';' -f 2 | head
# Name=DDX11L1
# Name=WASH7P
# Name=MIR6859-1
# Name=MIR1302-2HG
# Name=MIR1302-2
# Name=FAM138A
# Name=OR4G4P
# Name=OR4G11P
# Name=OR4F5
# Name=AL627309.1
And the reason makeTxDbFromGFF()
does this is that the ID
tag is not guaranteed to be meaningful outside the file (in the case of this file it's a mangled version of a public ID, and that mangling is arbitrary and up to the file provider). OTOH the problem with the Name
tag is that it's not guaranteed to be unique (in the case of this file it's not):
grep 'ID=gene:' Homo_sapiens.GRCh38.102.gff3 | cut -f 9 | cut -d ';' -f 2 | sort | uniq | wc --lines
# 59471
Still need to think about the best way to handle this.
Finally, note that you have 2 alternatives for importing gene models from Ensembl:
One is to use use makeTxDbFromEnsembl()
to retrieve the data directly from the Ensembl MySQL server:
txdb <- makeTxDbFromEnsembl("Homo sapiens", release=102)
length(transcripts(txdb))
[1] 254853
Note that this returns slightly more genes and transcripts than when importing the corresponding GTF or GFF3 file. It seems like Ensembl is omitting some genes/transcripts when they generate their GTF/GFF3 files from their database but I don't know based on which criteria or why they're doing that.
You can use the ensembldb package.
Cheers, H.
Also, for completeness, I should mention 2 more alternatives for importing Ensembl gene models (in addition to the 2 alternatives I mentioned above):
Thru the BioMart service:
txdb <- makeTxDbFromBiomart(biomart="ENSEMBL_MART_ENSEMBL", dataset="hsapiens_gene_ensembl")
Can be slow and unreliable, depending on the mood of the service.
Thru UCSC (only for Human genome hg19 or older, Mouse genome mm9 or older, and a few other genomes):
txdb <- makeTxDbFromUCSC(genome="hg19", tablename="ensGene")
See ?makeTxDbFromBiomart
and ?makeTxDbFromUCSC
for more information.
H.
Thanks for the really helpful information @hpages , much appreciated. This seems safe to close.
we still don't get the right nb of genes with the GFF3 file (working on it)
Done in GenomicFeatures 1.43.6 (commit c1e3fb92203e25a67bdb66a009f51671c6b4ab9e).
TxDb
generation currently differs quite significantly between Ensembl GTF and GFF3 files. Parsing the files directly with rtracklayer can return the same number of genes and transcripts in the Ensembl GFF3 file as seen in the GTF, which is correct.