Bioconductor / AnnotationHub

Client for the Bioconductor AnnotationHub web resource
15 stars 12 forks source link

Issue converting Ensembl GTFs #15

Closed jmacdon closed 4 years ago

jmacdon commented 4 years ago

In AnnotationHub:::.tidyGRanges the genome of a GTF is set by either using data from GenomeInfoDb, or by inferring from the GRanges itself. Previously for Ensembl GTF files, the latter is what happened, because GenomeInfoDb didn't support GRCh38.

> Seqinfo(genome = "GRCh38")
 Error in fetchSequenceInfo(genome) : genome "GRCh38" is not supported
> sessionInfo()
 R version 3.6.1 (2019-07-05)
 Platform: x86_64-pc-linux-gnu (64-bit)
 Running under: Debian GNU/Linux 8 (jessie)

But in the current version, GRCh38 is now supported but has different unplaced scaffolds than what we get from an Ensembl GTF, so when assigning the Seqinfo to the GRanges we get a bunch of NA values for the mis-matching unplaced scaffolds.

lshep commented 4 years ago

See also conversation on community slacks biochub channel https://community-bioc.slack.com/archives/CDSG30G66/p1590678484033900

hpages commented 4 years ago

I'm looking into this

hpages commented 4 years ago

Sorry for the delay.

The GTF files for Ensembl Human are tagged with the incorrect genome in AnnotationHub:

library(AnnotationHub)
ah <- AnnotationHub()
# snapshotDate(): 2020-05-23

query(ah, c("Ensembl", "100", "Homo sapiens", "GTF"))
# AnnotationHub with 4 records
## snapshotDate(): 2020-05-23
## $dataprovider: Ensembl
## $species: Homo sapiens
## $rdataclass: GRanges
## additional mcols(): taxonomyid, genome, description,
##   coordinate_1_based, maintainer, rdatadateadded, preparerclass, tags,
##   rdatapath, sourceurl, sourcetype 
## retrieve records with, e.g., 'object[["AH80074"]]' 
#
#             title                                           
#   AH80074 | Homo_sapiens.GRCh38.100.abinitio.gtf            
#   AH80075 | Homo_sapiens.GRCh38.100.chr.gtf                 
#   AH80076 | Homo_sapiens.GRCh38.100.chr_patch_hapl_scaff.gtf
#   AH80077 | Homo_sapiens.GRCh38.100.gtf                     

ah["AH80077"]$genome
# [1] "GRCh38"

The Ensembl release 100 is based on GRCh38.p13, not on GRCh38.

GRCh38.p13 introduces a bunch of new sequences w.r.t. GRCh38:

length(Seqinfo(genome="GRCh38"))
# [1] 455
length(Seqinfo(genome="GRCh38.p13"))
# [1] 640

You can get more details about what kind of sequences are new in GRCh38.p13 with:

table(getChromInfoFromNCBI("GRCh38")$SequenceRole)

#   assembled-molecule         alt-scaffold unlocalized-scaffold 
#                   25                  261                   42 
#    unplaced-scaffold      pseudo-scaffold            fix-patch 
#                  127                    0                    0 
#          novel-patch 
#                    0 

table(getChromInfoFromNCBI("GRCh38.p13")$SequenceRole)

#   assembled-molecule         alt-scaffold unlocalized-scaffold 
#                   25                  261                   42 
#    unplaced-scaffold      pseudo-scaffold            fix-patch 
#                  127                    0                  113 
#          novel-patch 
#                   72 

113 "fix-patch" and 72 "novel-patch" sequences.

So I guess tagging with the right genome would address the issue?

Also I'm not quite sure why .tidyGRanges is called on GTF resources but not on GFF3 resources (see .get1 methods for GTFFileResource and GFF3FileResource, respectively). Not calling .tidyGRanges on GTF resources would also solve the problem ;-)

hpages commented 4 years ago

FWIW a few months ago I added an internal utility in GenomeInfoDb for fetching the species index file for a given Ensembl release:

> species_index <- GenomeInfoDb:::fetch_species_index_from_Ensembl_FTP()
> species_index[1:3, ]
                  name                     species           division
1        Spiny chromis acanthochromis_polyacanthus EnsemblVertebrates
2 Eurasian sparrowhawk             accipiter_nisus EnsemblVertebrates
3                Panda      ailuropoda_melanoleuca EnsemblVertebrates
  taxonomy_id               assembly assembly_accession               genebuild
1       80966            ASM210954v1    GCA_002109545.1 2018-05-Ensembl/2020-03
2      211598 Accipiter_nisus_ver1.0    GCA_004320145.1 2019-07-Ensembl/2019-09
3        9646                ailMel1    GCA_000004335.1 2010-01-Ensembl/2011-12
  variation pan_compara peptide_compara genome_alignments other_alignments
1         N           N               Y                 Y                Y
2         N           N               Y                 Y                Y
3         N           N               Y                 Y                Y
                                 core_db species_id
1 acanthochromis_polyacanthus_core_100_1          1
2             accipiter_nisus_core_100_1          1
3      ailuropoda_melanoleuca_core_100_1          1

You can use this to lookup the assembly of a particular species, either by species or by taxonomy (by name is not reliable):

> species_index[species_index$species == "homo_sapiens", "assembly"]
[1] "GRCh38.p13"
> species_index[species_index$taxonomy_id == 9606, "assembly"]
[1] "GRCh38.p13"

GenomeInfoDb:::fetch_species_index_from_Ensembl_FTP() takes 2 arguments: the Ensembl release number (the latest release is picked up by default) and the Ensembl division (other divisions are bacteria, fungi, metazoa, plants, and protists).

lshep commented 4 years ago

Not sure the best way to solve this -- Since this is done on the fly we didn't want to have to download or read the files so the metadata collected is based on the parse of the file name "ftp://ftp.ensembl.org/pub/release-100/gtf/homo_sapiens/Homo_sapiens.GRCh38.100.gtf.gz" which odesn't indicate the patched version

lshep commented 4 years ago

Maybe altering the code to use GenomeInfoDb:::fetch_species_index_from_Ensembl_FTP() is the best option. Update the genome based on the value from the species index table if this is more accurate for all species?

hpages commented 4 years ago

Parsing the file name is definitely not reliable e.g. same problem with mouse (gives you GRCm38 instead of GRCm38.p6) and who knows how many other genomes.

The species index file gets updated at each release so seems more reliable. Also it's agnostic about the particular resource (GTF, GFF3, or other) which is another BIG advantage.

lshep commented 4 years ago

But ... I guess I'm curious as to why this affects older builds -- The resource @jmacdon was looking at was added in 2016 and for release-83. so why is it now failing?

> ah["AH50377"]
AnnotationHub with 1 record
# snapshotDate(): 2020-05-23
# names(): AH50377
# $dataprovider: Ensembl
# $species: Homo sapiens
# $rdataclass: GRanges
# $rdatadateadded: 2016-01-25
# $title: Homo_sapiens.GRCh38.83.gtf
# $description: Gene Annotation for Homo sapiens
# $taxonomyid: 9606
# $genome: GRCh38
# $sourcetype: GTF
# $sourceurl: ftp://ftp.ensembl.org/pub/release-83/gtf/homo_sapiens/Homo_sap...
# $sourcesize: 45686084
# $tags: c("GTF", "ensembl", "Gene", "Transcript", "Annotation") 
# retrieve record with 'object[["AH50377"]]' 
lshep commented 4 years ago

I'll work on implementing the change to get the genome from GenomeInfoDb:::fetch_species_index_from_Ensembl_FTP() but I guess I'm curious as to why all the old resources are broken from releases prior to that we didn't see issues with before?

hpages commented 4 years ago

Because Seqinfo(genome="GRCh38") now is supported (starting with BioC 3.11), as Jim pointed out.

hpages commented 4 years ago

Also since you are on it, you'd really want to take a look at the inconsistent treatment between GTF and GFF3 resources (the former uses .tidyGRanges but not the latter). I can't think of any good reason why that would be.

lshep commented 4 years ago

Yeah. And actually tracing this code through for the resource listed --- it looks like the GTF on the fly actually use GRangesResource rather than GTFFileResource which I wasn't expecting.

lshep commented 4 years ago

Just as an update. I'm looking into using herve's suggested function instead, updating the code for future and using it to update existing references. Taking a bit of time as I want to make sure it is correct and I'm looking at how to do bulk updates to mysql database.
But work in progress...

lshep commented 4 years ago

So more issues @hpages --- the older releases like the one @jmacdon mentioned don't have those tables to reference so even if I update the entries in the database to the ones I can map with your function older ones will still not necessarily be correct -- @mtmorgan open to ideas but I don't think there is a current way to fix these older ones as there is no documented mapping? I can fix any release that has access to these files which seems to begin with release 96.

jmacdon commented 4 years ago

Previously if .tidyGRanges couldn't get the data from GenomeInfoDb, it just fell back to getting the genome from the GRangesResource. Is that no longer a reasonable thing to do?

lshep commented 4 years ago

But thats the original issue right -- now there is information in GenomeInfoDb for "GRCh38" so it finds something. Is there some additional processing that can be done to remove the NA information that could be added to .tidyGRanges?

jmacdon commented 4 years ago

So how were you planning on determining the Ensembl release for a given object? If I want to use AH50377, that's release 83. And if I try to get the species index it does

 GenomeInfoDb:::fetch_species_index_from_Ensembl_FTP("83") 
Error in download.file(url, destfile, quiet = TRUE) : 
  cannot open URL 'ftp://ftp.ensembl.org/pub/release-83/species_EnsemblVertebrates.txt'

In which case you could just then failover to using the code that infers the genome from the GRangesResource instead.

lshep commented 4 years ago

The function would be used when initially loading the data into the hub so it would have the correct data going in instead of being corrected later on. So moving forward this shouldn't be an issue. The problem is trying to fix the old ones which seems like it will need to be done in the .tidyGRanges for the older releases of data. I'm in the process of fixing the database entries for 96-100 where the file is present and then will update the code appropriately when initializing entries in the database. I'm still open to ideas on how to fix the old ones without manually (guessing/manipulating) the few that we know of -- again is there some filtering or altering of the GRanges object to remote the NA's / extra scaffolds to have a valid object?

jmacdon commented 4 years ago

I have two thoughts. First, it seems like you are putting something extra into the .rda that's loaded into the hub that will provide the genome data, for any Ensembl release > 95, right? And that will be specific to Ensembl GTF or GFF files? So when the GTF is read in, if those extra data are there, they get put into the genome column for the seqinfo?

And that means by definition if you have an Ensembl GTF that's 95 or earlier, the extra data won't be there. In which case we could do what was done previously, which is to not try to match up with GenomeInfoDb, but instead just get the genome from the GTF and stick it in every row of the Seqinfo object, regardless of what scaffolds are there. We are getting all these NA values simply because we aren't accurately matching the patch levels, and it doesn't look like we will be able to do that for < 96 anyway.

Which brings up my second thought, which is why we are doing this checking in the first place. I can't come up with any reason why there might be a GTF from a reputable source that has genomic regions from different genome builds. Maybe that happens, but I can't see why, or how it would make any sense to do something like that. I would be really shocked if you could download a GTF from Ensembl or NCBI or UCSC that is based on more than one genome, so maybe it's sufficient to ensure that we know what genome it's based on and then blindly stuff that information into all the rows of the Seqinfo?

lshep commented 4 years ago

I've updated the current database for release 96-100 for genome and species if necessary.

The problem is the .tidyGRanges is used for other types of resources too. And there are two different levels here: there is making sure the right genome/species get put into the database in the first place and then there is the information that is returned after converting to a GRanges object and performing the .tidyGRanges function. This function was created and existed before I took over maintenance of the package so I'd have to dig deeper into its relevance.

Next steps:

  1. Fix current code so all future versions have correct species and genome utilizing the GenomeInfoDb:::fetch_species_index_from_Ensembl_FTP
  2. Evaluate the tidyGRanges code to see what sort of fix could be implemented for the pre 96 releases --
hpages commented 4 years ago

The NAs are introduced by this line (line 137, file AnnotationHub/R/utilities.R):

newSeqinfo <- newSeqinfo[GenomeInfoDb::seqlevels(gr)]

For example:

newSeqinfo <- Seqinfo(c("chr1", "chr2"), c(150, 20), c(FALSE, FALSE), "GRCh38")
newSeqinfo
# Seqinfo object with 2 sequences from GRCh38 genome:
#   seqnames seqlengths isCircular genome
#   chr1            150      FALSE GRCh38
#   chr2             20      FALSE GRCh38

newSeqinfo[c("chr1", "chr2", "patch")]
# Seqinfo object with 3 sequences from 2 genomes (GRCh38, NA):
#   seqnames seqlengths isCircular genome
#   chr1            150      FALSE GRCh38
#   chr2             20      FALSE GRCh38
#   patch            NA         NA   <NA>

The fact that GenomeInfoDb::seqlevels(gr) contains sequence names not in newSeqinfo is an indication that the resource was tagged with the wrong genome. So before subsetting, check that all the seqlevels in gr are in newSeqinfo, and, if they are not, take the other route (the is.null(newSeqinfo) route).

In other words, what I'm proposing is to treat the "wrong newSeqinfo" situation like the "no newSeqinfo at all" situation.

Hope this helps.

lshep commented 4 years ago

@hpages This check fails even with the newer annotations labelled with the "correct" genome: "GRCh38.p13". My thought was to change 120 to if (is.null(newSeqinfo) || !all(GenomeInfoDb::seqlevels(gr) %in% seqlevels(newSeqinfo))) { as that seems to be what you are recommending, but then even the correct genome as identified by using your table

> ah["AH80077"]$genome
[1] "GRCh38.p13"

returns FALSE for all(GenomeInfoDb::seqlevels(gr) %in% seqlevels(newSeqinfo))

hpages commented 4 years ago

Well, it looks like the Ensembl folks didn't like how the scaffolds are named in the official GRCh38.p13 assembly and decided to rename them by their GenBank accession:

library(rtracklayer)
gr <- import("Homo_sapiens.GRCh38.100.gtf")
seqlevels(gr)[1:40]
#  [1] "1"          "2"          "3"          "4"          "5"         
#  [6] "6"          "7"          "8"          "9"          "10"        
# [11] "11"         "12"         "13"         "14"         "15"        
# [16] "16"         "17"         "18"         "19"         "20"        
# [21] "21"         "22"         "X"          "Y"          "MT"        
# [26] "GL000009.2" "GL000194.1" "GL000195.1" "GL000205.2" "GL000213.1"
# [31] "GL000216.2" "GL000218.1" "GL000219.1" "GL000220.1" "GL000225.1"
# [36] "KI270442.1" "KI270711.1" "KI270713.1" "KI270721.1" "KI270726.1"

newSeqinfo <- Seqinfo(genome="GRCh38.p13")
seqlevels(newSeqinfo)[1:40]
#  [1] "1"                "2"                "3"                "4"               
#  [5] "5"                "6"                "7"                "8"               
#  [9] "9"                "10"               "11"               "12"              
# [13] "13"               "14"               "15"               "16"              
# [17] "17"               "18"               "19"               "20"              
# [21] "21"               "22"               "X"                "Y"               
# [25] "MT"               "HSCHR1_1_CTG31"   "HSCHR1_2_CTG31"   "HSCHR1_3_CTG31"  
# [29] "HSCHR1_1_CTG32_1" "HSCHR1_1_CTG11"   "HSCHR1_2_CTG32_1" "HSCHR1_1_CTG3"   
# [33] "HSCHR1_3_CTG32_1" "HSCHR1_4_CTG32_1" "HSCHR1_4_CTG31"   "HSCHR1_2_CTG3"   
# [37] "HSCHR2_1_CTG5"    "HSCHR2_1_CTG7_2"  "HSCHR2_2_CTG7_2"  "HSCHR2_1_CTG15"  

which is a bummer but what can we do about it?

The "treat the wrong newSeqinfo situation like the no newSeqinfo at all situation" approach still stands.

Not that it makes much difference in the end but at least now this GTF resource is tagged with the correct genome.

lshep commented 4 years ago

Okay. Seems like the one line change should be sufficient. I'll make the change and do a little more testing to be sure.

hpages commented 4 years ago

Thanks Lori.

Correction: when I said "at least now this GTF resource is tagged with the correct genome", it's still not entirely true. It would be more accurate to say that now we tag it the same way as the Ensembl folks do, which is an improvement.

But whether that tag is the correct genome is debatable and another story.