Large discrepancy seen when parsing Ensembl GTF vs. GFF3 files

mjsteinbaugh commented 3 years ago

TxDb generation currently differs quite significantly between Ensembl GTF and GFF3 files. Parsing the files directly with rtracklayer can return the same number of genes and transcripts in the Ensembl GFF3 file as seen in the GTF, which is correct.

suppressPackageStartupMessages({
    library(GenomicFeatures)
})
gtf <- paste0(
    "ftp://ftp.ensembl.org/pub/release-102/gtf/",
    "homo_sapiens/Homo_sapiens.GRCh38.102.gtf.gz"
)
gff3 <- paste0(
    "ftp://ftp.ensembl.org/pub/release-102/gff3/",
    "homo_sapiens/Homo_sapiens.GRCh38.102.gff3.gz"
)
suppressWarnings({
    suppressMessages({
        txdb_gtf <- makeTxDbFromGFF(file = gtf)
        txdb_gff3 <- makeTxDbFromGFF(file = gff3)
    })
})

print(txdb_gtf)
## TxDb object:
## # Db type: TxDb
## # Supporting package: GenomicFeatures
## # Data source: ftp://ftp.ensembl.org/pub/release-102/gtf/homo_sapiens/Homo_sapiens.GRCh38.102.gtf.gz
## # Organism: Homo sapiens
## # Taxonomy ID: 9606
## # miRBase build ID: NA
## # Genome: NA
## # Nb of transcripts: 232024
## # Db created by: GenomicFeatures package from Bioconductor
## # Creation time: 2021-01-25 08:31:27 -0500 (Mon, 25 Jan 2021)
## # GenomicFeatures version at creation time: 1.42.1
## # RSQLite version at creation time: 2.2.3
## # DBSCHEMAVERSION: 1.2

print(txdb_gff3)
## TxDb object:
## # Db type: TxDb
## # Supporting package: GenomicFeatures
## # Data source: ftp://ftp.ensembl.org/pub/release-102/gff3/homo_sapiens/Homo_sapiens.GRCh38.102.gff3.gz
## # Organism: Homo sapiens
## # Taxonomy ID: 9606
## # miRBase build ID: NA
## # Genome: NA
## # Nb of transcripts: 230447
## # Db created by: GenomicFeatures package from Bioconductor
## # Creation time: 2021-01-25 08:32:59 -0500 (Mon, 25 Jan 2021)
## # GenomicFeatures version at creation time: 1.42.1
## # RSQLite version at creation time: 2.2.3
## # DBSCHEMAVERSION: 1.2

length(genes(txdb_gtf))
## [1] 60675
## CORRECT.

suppressMessages({
    length(genes(txdb_gff3))
})
## [1] 35216
## INCORRECT.

length(transcripts(txdb_gtf))
## [1] 232024
## CORRECT.

length(transcripts(txdb_gff3))
## [1] 230447
## INCORRECT.

mjsteinbaugh commented 3 years ago

Here's a simple reprex using rtracklayer showing that Ensembl GFF3 files contain the correct number of genes and transcripts:

suppressPackageStartupMessages({
    library(rtracklayer)
})
file <- "ftp://ftp.ensembl.org/pub/release-102/gff3/homo_sapiens/Homo_sapiens.GRCh38.102.gff3.gz"
object <- import(file)

## Genes.
keep <- !is.na(mcols(object)[["gene_id"]])
sum(keep)
## [1] 60675
## CORRECT.
genes <- object[keep]

## Transcripts.
keep <- !is.na(mcols(object)[["transcript_id"]])
sum(keep)
## [1] 232024
## CORRECT.
transcripts <- object[keep]

hpages commented 3 years ago

Hi @mjsteinbaugh ,

Thanks for the report and sorry for the late response.

First let's clarify the nb of genes and transcripts contained in those files once for all. From the Unix shell:

In the GTF file:

cut -f 3 Homo_sapiens.GRCh38.102.gtf | grep -w gene | wc --lines
# 60675
cut -f 3 Homo_sapiens.GRCh38.102.gtf | grep -w transcript | wc --lines
# 232024

In the GFF3 file:

grep 'ID=gene:' Homo_sapiens.GRCh38.102.gff3 | wc --lines
# 60675
grep 'ID=transcript:' Homo_sapiens.GRCh38.102.gff3 | wc --lines
# 232024

So the good news is that the counts are the same in the GTF and GFF3 files.

The reason why makeTxDbFromGFF() was missing a lot of genes and transcripts is because Ensembl has apparently started to use new Sequence Ontology terms like ncRNA_gene, unconfirmed_transcript, V_gene_segment, D_gene_segment, etc... in the type column of their GFF3 files, and makeTxDbFromGFF() was not able to recognize these terms as "gene features" or "transcript features".

Note that the problem of recognizing these features is specific to GFF3 files (GTF files always use the terms gene or transcript which makes life much easier) and I don't have a good solution to deal with this at the moment.

Anyways, this is addressed in GenomicFeatures 1.43.5 (devel, see commit d7f5980fea7b74489afd8ef000e8adc486553950) and GenomicFeatures 1.42.2 (release).

However, one issue remains:

library(GenomicFeatures)
txdb_gff3 <- makeTxDbFromGFF("Homo_sapiens.GRCh38.102.gff3")

length(transcripts(txdb_gff3))
[1] 232024

## You must use single.strand.genes.only=FALSE to extract all genes.
## See ?genes
all_genes <- genes(txdb_gff3, single.strand.genes.only=FALSE)
length(all_genes)
[1] 59471

Some genes are still missing!

The reason for this is that makeTxDbFromGFF() imports the Name tag from the GFF3, rather than the ID or gene_id tag:

grep 'ID=gene:' Homo_sapiens.GRCh38.102.gff3 | cut -f 9 | cut -d ';' -f 1 | head
# ID=gene:ENSG00000223972
# ID=gene:ENSG00000227232
# ID=gene:ENSG00000278267
# ID=gene:ENSG00000243485
# ID=gene:ENSG00000284332
# ID=gene:ENSG00000237613
# ID=gene:ENSG00000268020
# ID=gene:ENSG00000240361
# ID=gene:ENSG00000186092
# ID=gene:ENSG00000238009

grep 'ID=gene:' Homo_sapiens.GRCh38.102.gff3 | cut -f 9 | cut -d ';' -f 5 | head
# gene_id=ENSG00000223972
# gene_id=ENSG00000227232
# gene_id=ENSG00000278267
# gene_id=ENSG00000243485
# gene_id=ENSG00000284332
# gene_id=ENSG00000237613
# gene_id=ENSG00000268020
# gene_id=ENSG00000240361
# gene_id=ENSG00000186092
# gene_id=ENSG00000238009

grep 'ID=gene:' Homo_sapiens.GRCh38.102.gff3 | cut -f 9 | cut -d ';' -f 2 | head
# Name=DDX11L1
# Name=WASH7P
# Name=MIR6859-1
# Name=MIR1302-2HG
# Name=MIR1302-2
# Name=FAM138A
# Name=OR4G4P
# Name=OR4G11P
# Name=OR4F5
# Name=AL627309.1

And the reason makeTxDbFromGFF() does this is that the ID tag is not guaranteed to be meaningful outside the file (in the case of this file it's a mangled version of a public ID, and that mangling is arbitrary and up to the file provider). OTOH the problem with the Name tag is that it's not guaranteed to be unique (in the case of this file it's not):

grep 'ID=gene:' Homo_sapiens.GRCh38.102.gff3 | cut -f 9 | cut -d ';' -f 2 | sort | uniq | wc --lines
# 59471

Still need to think about the best way to handle this.

Finally, note that you have 2 alternatives for importing gene models from Ensembl:

One is to use use makeTxDbFromEnsembl() to retrieve the data directly from the Ensembl MySQL server:
```
txdb <- makeTxDbFromEnsembl("Homo sapiens", release=102)
length(transcripts(txdb))
[1] 254853
```
Note that this returns slightly more genes and transcripts than when importing the corresponding GTF or GFF3 file. It seems like Ensembl is omitting some genes/transcripts when they generate their GTF/GFF3 files from their database but I don't know based on which criteria or why they're doing that.
You can use the ensembldb package.

Cheers, H.

hpages commented 3 years ago

Also, for completeness, I should mention 2 more alternatives for importing Ensembl gene models (in addition to the 2 alternatives I mentioned above):

Thru the BioMart service:

txdb <- makeTxDbFromBiomart(biomart="ENSEMBL_MART_ENSEMBL", dataset="hsapiens_gene_ensembl")

Can be slow and unreliable, depending on the mood of the service.

Thru UCSC (only for Human genome hg19 or older, Mouse genome mm9 or older, and a few other genomes):
```
txdb <- makeTxDbFromUCSC(genome="hg19", tablename="ensGene")
```

See ?makeTxDbFromBiomart and ?makeTxDbFromUCSC for more information.

H.

mjsteinbaugh commented 3 years ago

Thanks for the really helpful information @hpages , much appreciated. This seems safe to close.

hpages commented 3 years ago

we still don't get the right nb of genes with the GFF3 file (working on it)

hpages commented 3 years ago

Done in GenomicFeatures 1.43.6 (commit c1e3fb92203e25a67bdb66a009f51671c6b4ab9e).

Bioconductor / GenomicFeatures

Large discrepancy seen when parsing Ensembl GTF vs. GFF3 files #28