Closed jmacdon closed 4 years ago
See also conversation on community slacks biochub channel https://community-bioc.slack.com/archives/CDSG30G66/p1590678484033900
I'm looking into this
Sorry for the delay.
The GTF files for Ensembl Human are tagged with the incorrect genome in AnnotationHub:
library(AnnotationHub)
ah <- AnnotationHub()
# snapshotDate(): 2020-05-23
query(ah, c("Ensembl", "100", "Homo sapiens", "GTF"))
# AnnotationHub with 4 records
## snapshotDate(): 2020-05-23
## $dataprovider: Ensembl
## $species: Homo sapiens
## $rdataclass: GRanges
## additional mcols(): taxonomyid, genome, description,
## coordinate_1_based, maintainer, rdatadateadded, preparerclass, tags,
## rdatapath, sourceurl, sourcetype
## retrieve records with, e.g., 'object[["AH80074"]]'
#
# title
# AH80074 | Homo_sapiens.GRCh38.100.abinitio.gtf
# AH80075 | Homo_sapiens.GRCh38.100.chr.gtf
# AH80076 | Homo_sapiens.GRCh38.100.chr_patch_hapl_scaff.gtf
# AH80077 | Homo_sapiens.GRCh38.100.gtf
ah["AH80077"]$genome
# [1] "GRCh38"
The Ensembl release 100 is based on GRCh38.p13, not on GRCh38.
GRCh38.p13 introduces a bunch of new sequences w.r.t. GRCh38:
length(Seqinfo(genome="GRCh38"))
# [1] 455
length(Seqinfo(genome="GRCh38.p13"))
# [1] 640
You can get more details about what kind of sequences are new in GRCh38.p13 with:
table(getChromInfoFromNCBI("GRCh38")$SequenceRole)
# assembled-molecule alt-scaffold unlocalized-scaffold
# 25 261 42
# unplaced-scaffold pseudo-scaffold fix-patch
# 127 0 0
# novel-patch
# 0
table(getChromInfoFromNCBI("GRCh38.p13")$SequenceRole)
# assembled-molecule alt-scaffold unlocalized-scaffold
# 25 261 42
# unplaced-scaffold pseudo-scaffold fix-patch
# 127 0 113
# novel-patch
# 72
113 "fix-patch" and 72 "novel-patch" sequences.
So I guess tagging with the right genome would address the issue?
Also I'm not quite sure why .tidyGRanges
is called on GTF resources but not on GFF3 resources (see .get1
methods for GTFFileResource and GFF3FileResource, respectively). Not calling .tidyGRanges
on GTF resources would also solve the problem ;-)
FWIW a few months ago I added an internal utility in GenomeInfoDb for fetching the species index file for a given Ensembl release:
> species_index <- GenomeInfoDb:::fetch_species_index_from_Ensembl_FTP()
> species_index[1:3, ]
name species division
1 Spiny chromis acanthochromis_polyacanthus EnsemblVertebrates
2 Eurasian sparrowhawk accipiter_nisus EnsemblVertebrates
3 Panda ailuropoda_melanoleuca EnsemblVertebrates
taxonomy_id assembly assembly_accession genebuild
1 80966 ASM210954v1 GCA_002109545.1 2018-05-Ensembl/2020-03
2 211598 Accipiter_nisus_ver1.0 GCA_004320145.1 2019-07-Ensembl/2019-09
3 9646 ailMel1 GCA_000004335.1 2010-01-Ensembl/2011-12
variation pan_compara peptide_compara genome_alignments other_alignments
1 N N Y Y Y
2 N N Y Y Y
3 N N Y Y Y
core_db species_id
1 acanthochromis_polyacanthus_core_100_1 1
2 accipiter_nisus_core_100_1 1
3 ailuropoda_melanoleuca_core_100_1 1
You can use this to lookup the assembly of a particular species, either by species or by taxonomy (by name is not reliable):
> species_index[species_index$species == "homo_sapiens", "assembly"]
[1] "GRCh38.p13"
> species_index[species_index$taxonomy_id == 9606, "assembly"]
[1] "GRCh38.p13"
GenomeInfoDb:::fetch_species_index_from_Ensembl_FTP()
takes 2 arguments: the Ensembl release number (the latest release is picked up by default) and the Ensembl division (other divisions are bacteria, fungi, metazoa, plants, and protists).
Not sure the best way to solve this -- Since this is done on the fly we didn't want to have to download or read the files so the metadata collected is based on the parse of the file name "ftp://ftp.ensembl.org/pub/release-100/gtf/homo_sapiens/Homo_sapiens.GRCh38.100.gtf.gz" which odesn't indicate the patched version
Maybe altering the code to use GenomeInfoDb:::fetch_species_index_from_Ensembl_FTP()
is the best option. Update the genome based on the value from the species index table if this is more accurate for all species?
Parsing the file name is definitely not reliable e.g. same problem with mouse (gives you GRCm38 instead of GRCm38.p6) and who knows how many other genomes.
The species index file gets updated at each release so seems more reliable. Also it's agnostic about the particular resource (GTF, GFF3, or other) which is another BIG advantage.
But ... I guess I'm curious as to why this affects older builds -- The resource @jmacdon was looking at was added in 2016 and for release-83. so why is it now failing?
> ah["AH50377"]
AnnotationHub with 1 record
# snapshotDate(): 2020-05-23
# names(): AH50377
# $dataprovider: Ensembl
# $species: Homo sapiens
# $rdataclass: GRanges
# $rdatadateadded: 2016-01-25
# $title: Homo_sapiens.GRCh38.83.gtf
# $description: Gene Annotation for Homo sapiens
# $taxonomyid: 9606
# $genome: GRCh38
# $sourcetype: GTF
# $sourceurl: ftp://ftp.ensembl.org/pub/release-83/gtf/homo_sapiens/Homo_sap...
# $sourcesize: 45686084
# $tags: c("GTF", "ensembl", "Gene", "Transcript", "Annotation")
# retrieve record with 'object[["AH50377"]]'
I'll work on implementing the change to get the genome from GenomeInfoDb:::fetch_species_index_from_Ensembl_FTP()
but I guess I'm curious as to why all the old resources are broken from releases prior to that we didn't see issues with before?
Because Seqinfo(genome="GRCh38")
now is supported (starting with BioC 3.11), as Jim pointed out.
Also since you are on it, you'd really want to take a look at the inconsistent treatment between GTF and GFF3 resources (the former uses .tidyGRanges
but not the latter). I can't think of any good reason why that would be.
Yeah. And actually tracing this code through for the resource listed --- it looks like the GTF on the fly actually use GRangesResource rather than GTFFileResource which I wasn't expecting.
Just as an update. I'm looking into using herve's suggested function instead, updating the code for future and using it to update existing references. Taking a bit of time as I want to make sure it is correct and I'm looking at how to do bulk updates to mysql database.
But work in progress...
So more issues @hpages --- the older releases like the one @jmacdon mentioned don't have those tables to reference so even if I update the entries in the database to the ones I can map with your function older ones will still not necessarily be correct -- @mtmorgan open to ideas but I don't think there is a current way to fix these older ones as there is no documented mapping? I can fix any release that has access to these files which seems to begin with release 96.
Previously if .tidyGRanges
couldn't get the data from GenomeInfoDb, it just fell back to getting the genome from the GRangesResource
. Is that no longer a reasonable thing to do?
But thats the original issue right -- now there is information in GenomeInfoDb for "GRCh38" so it finds something. Is there some additional processing that can be done to remove the NA information that could be added to .tidyGRanges?
So how were you planning on determining the Ensembl release for a given object? If I want to use AH50377, that's release 83. And if I try to get the species index it does
GenomeInfoDb:::fetch_species_index_from_Ensembl_FTP("83")
Error in download.file(url, destfile, quiet = TRUE) :
cannot open URL 'ftp://ftp.ensembl.org/pub/release-83/species_EnsemblVertebrates.txt'
In which case you could just then failover to using the code that infers the genome from the GRangesResource
instead.
The function would be used when initially loading the data into the hub so it would have the correct data going in instead of being corrected later on. So moving forward this shouldn't be an issue. The problem is trying to fix the old ones which seems like it will need to be done in the .tidyGRanges for the older releases of data. I'm in the process of fixing the database entries for 96-100 where the file is present and then will update the code appropriately when initializing entries in the database. I'm still open to ideas on how to fix the old ones without manually (guessing/manipulating) the few that we know of -- again is there some filtering or altering of the GRanges object to remote the NA's / extra scaffolds to have a valid object?
I have two thoughts. First, it seems like you are putting something extra into the .rda that's loaded into the hub that will provide the genome data, for any Ensembl release > 95, right? And that will be specific to Ensembl GTF or GFF files? So when the GTF is read in, if those extra data are there, they get put into the genome column for the seqinfo?
And that means by definition if you have an Ensembl GTF that's 95 or earlier, the extra data won't be there. In which case we could do what was done previously, which is to not try to match up with GenomeInfoDb, but instead just get the genome from the GTF and stick it in every row of the Seqinfo object, regardless of what scaffolds are there. We are getting all these NA values simply because we aren't accurately matching the patch levels, and it doesn't look like we will be able to do that for < 96 anyway.
Which brings up my second thought, which is why we are doing this checking in the first place. I can't come up with any reason why there might be a GTF from a reputable source that has genomic regions from different genome builds. Maybe that happens, but I can't see why, or how it would make any sense to do something like that. I would be really shocked if you could download a GTF from Ensembl or NCBI or UCSC that is based on more than one genome, so maybe it's sufficient to ensure that we know what genome it's based on and then blindly stuff that information into all the rows of the Seqinfo?
I've updated the current database for release 96-100 for genome and species if necessary.
The problem is the .tidyGRanges is used for other types of resources too. And there are two different levels here: there is making sure the right genome/species get put into the database in the first place and then there is the information that is returned after converting to a GRanges object and performing the .tidyGRanges function. This function was created and existed before I took over maintenance of the package so I'd have to dig deeper into its relevance.
Next steps:
GenomeInfoDb:::fetch_species_index_from_Ensembl_FTP
The NAs are introduced by this line (line 137, file AnnotationHub/R/utilities.R
):
newSeqinfo <- newSeqinfo[GenomeInfoDb::seqlevels(gr)]
For example:
newSeqinfo <- Seqinfo(c("chr1", "chr2"), c(150, 20), c(FALSE, FALSE), "GRCh38")
newSeqinfo
# Seqinfo object with 2 sequences from GRCh38 genome:
# seqnames seqlengths isCircular genome
# chr1 150 FALSE GRCh38
# chr2 20 FALSE GRCh38
newSeqinfo[c("chr1", "chr2", "patch")]
# Seqinfo object with 3 sequences from 2 genomes (GRCh38, NA):
# seqnames seqlengths isCircular genome
# chr1 150 FALSE GRCh38
# chr2 20 FALSE GRCh38
# patch NA NA <NA>
The fact that GenomeInfoDb::seqlevels(gr)
contains sequence names not in newSeqinfo
is an indication that the resource was tagged with the wrong genome. So before subsetting, check that all the seqlevels in gr
are in newSeqinfo
, and, if they are not, take the other route (the is.null(newSeqinfo)
route).
In other words, what I'm proposing is to treat the "wrong newSeqinfo
" situation like the "no newSeqinfo
at all" situation.
Hope this helps.
@hpages This check fails even with the newer annotations labelled with the "correct" genome:
"GRCh38.p13". My thought was to change 120 to if (is.null(newSeqinfo) || !all(GenomeInfoDb::seqlevels(gr) %in% seqlevels(newSeqinfo))) {
as that seems to be what you are recommending, but then even the correct genome as identified by using your table
> ah["AH80077"]$genome
[1] "GRCh38.p13"
returns FALSE for all(GenomeInfoDb::seqlevels(gr) %in% seqlevels(newSeqinfo))
Well, it looks like the Ensembl folks didn't like how the scaffolds are named in the official GRCh38.p13 assembly and decided to rename them by their GenBank accession:
library(rtracklayer)
gr <- import("Homo_sapiens.GRCh38.100.gtf")
seqlevels(gr)[1:40]
# [1] "1" "2" "3" "4" "5"
# [6] "6" "7" "8" "9" "10"
# [11] "11" "12" "13" "14" "15"
# [16] "16" "17" "18" "19" "20"
# [21] "21" "22" "X" "Y" "MT"
# [26] "GL000009.2" "GL000194.1" "GL000195.1" "GL000205.2" "GL000213.1"
# [31] "GL000216.2" "GL000218.1" "GL000219.1" "GL000220.1" "GL000225.1"
# [36] "KI270442.1" "KI270711.1" "KI270713.1" "KI270721.1" "KI270726.1"
newSeqinfo <- Seqinfo(genome="GRCh38.p13")
seqlevels(newSeqinfo)[1:40]
# [1] "1" "2" "3" "4"
# [5] "5" "6" "7" "8"
# [9] "9" "10" "11" "12"
# [13] "13" "14" "15" "16"
# [17] "17" "18" "19" "20"
# [21] "21" "22" "X" "Y"
# [25] "MT" "HSCHR1_1_CTG31" "HSCHR1_2_CTG31" "HSCHR1_3_CTG31"
# [29] "HSCHR1_1_CTG32_1" "HSCHR1_1_CTG11" "HSCHR1_2_CTG32_1" "HSCHR1_1_CTG3"
# [33] "HSCHR1_3_CTG32_1" "HSCHR1_4_CTG32_1" "HSCHR1_4_CTG31" "HSCHR1_2_CTG3"
# [37] "HSCHR2_1_CTG5" "HSCHR2_1_CTG7_2" "HSCHR2_2_CTG7_2" "HSCHR2_1_CTG15"
which is a bummer but what can we do about it?
The "treat the wrong newSeqinfo situation like the no newSeqinfo at all situation" approach still stands.
Not that it makes much difference in the end but at least now this GTF resource is tagged with the correct genome.
Okay. Seems like the one line change should be sufficient. I'll make the change and do a little more testing to be sure.
Thanks Lori.
Correction: when I said "at least now this GTF resource is tagged with the correct genome", it's still not entirely true. It would be more accurate to say that now we tag it the same way as the Ensembl folks do, which is an improvement.
But whether that tag is the correct genome is debatable and another story.
In
AnnotationHub:::.tidyGRanges
the genome of a GTF is set by either using data fromGenomeInfoDb
, or by inferring from theGRanges
itself. Previously for Ensembl GTF files, the latter is what happened, becauseGenomeInfoDb
didn't support GRCh38.But in the current version, GRCh38 is now supported but has different unplaced scaffolds than what we get from an Ensembl GTF, so when assigning the
Seqinfo
to theGRanges
we get a bunch ofNA
values for the mis-matching unplaced scaffolds.