Bioconductor / AnnotationHub

Client for the Bioconductor AnnotationHub web resource
16 stars 13 forks source link

A wrong match for "AH116340" #56

Closed bioinf-kud closed 2 months ago

bioinf-kud commented 2 months ago

There is a wrong AHid match. In the description, "AH116340" is an EnsDb for Mus musculus based on Ensembl version 111, but when I tried to load it and see its information, I found that "AH116340" is actually an EnsDb for Homo sapiens based on Ensembl version 105. log:

select AnnotationHub with 13 records snapshotDate(): 2023-10-23 $dataprovider: Ensembl $species: Mus musculus $rdataclass: EnsDb additional mcols(): taxonomyid, genome, description, coordinate_1_based, maintainer, rdatadateadded, preparerclass, tags, rdatapath, sourceurl, sourcetype retrieve records with, e.g., 'object[["AH116325"]]' title
AH116325 | Ensembl 111 EnsDb for Mus musculus AH116326 | Ensembl 111 EnsDb for Mus musculus AH116327 | Ensembl 111 EnsDb for Mus musculus AH116328 | Ensembl 111 EnsDb for Mus musculus AH116329 | Ensembl 111 EnsDb for Mus musculus ... ...
AH116334 | Ensembl 111 EnsDb for Mus musculus AH116335 | Ensembl 111 EnsDb for Mus musculus AH116336 | Ensembl 111 EnsDb for Mus musculus AH116337 | Ensembl 111 EnsDb for Mus musculus AH116340 | Ensembl 111 EnsDb for Mus musculus

select$description [1] "Gene and protein annotations for Mus musculus based on Ensembl version 111." [2] "Gene and protein annotations for Mus musculus based on Ensembl version 111." [3] "Gene and protein annotations for Mus musculus based on Ensembl version 111." [4] "Gene and protein annotations for Mus musculus based on Ensembl version 111." [5] "Gene and protein annotations for Mus musculus based on Ensembl version 111." [6] "Gene and protein annotations for Mus musculus based on Ensembl version 111." [7] "Gene and protein annotations for Mus musculus based on Ensembl version 111." [8] "Gene and protein annotations for Mus musculus based on Ensembl version 111." [9] "Gene and protein annotations for Mus musculus based on Ensembl version 111." [10] "Gene and protein annotations for Mus musculus based on Ensembl version 111." [11] "Gene and protein annotations for Mus musculus based on Ensembl version 111." [12] "Gene and protein annotations for Mus musculus based on Ensembl version 111." [13] "Gene and protein annotations for Mus musculus based on Ensembl version 111."

select$species [1] "Mus musculus" "Mus musculus" "Mus musculus" "Mus musculus" "Mus musculus" "Mus musculus" [7] "Mus musculus" "Mus musculus" "Mus musculus" "Mus musculus" "Mus musculus" "Mus musculus" [13] "Mus musculus"

select$genome [1] "129S1_SvImJ_v1" "A_J_v1" "AKR_J_v1" "BALB_cJ_v1" "C3H_HeJ_v1"
[6] "C57BL_6NJ_v1" "CBA_J_v1" "DBA_2J_v1" "FVB_NJ_v1" "LP_J_v1"
[11] "NOD_ShiLtJ_v1" "NZO_HlLtJ_v1" "GRCm39"

edb<- ah[["AH116340"]] loading from cache

edb EnsDb for Ensembl: |Backend: SQLite |Db type: EnsDb |Type of Gene ID: Ensembl Gene ID |Supporting package: ensembldb |Db created by: ensembldb package from Bioconductor |script_version: 0.3.7 |Creation time: Sat Dec 18 14:48:15 2021 |ensembl_version: 105 |ensembl_host: localhost |Organism: Homo sapiens |taxonomy_id: 9606 |genome_build: GRCh38 |DBSCHEMAVERSION: 2.2 | No. of genes: 69329. | No. of transcripts: 268255. |Protein data available.

bioinf-kud commented 2 months ago

Looks like AH116340 is linked to AH98047

select<-ah[ah$description=="Gene and protein annotations for Homo sapiens based on Ensembl version 105.",] select AnnotationHub with 1 record snapshotDate(): 2023-10-23 names(): AH98047 $dataprovider: Ensembl $species: Homo sapiens $rdataclass: EnsDb $rdatadateadded: 2021-10-20 $title: Ensembl 105 EnsDb for Homo sapiens $description: Gene and protein annotations for Homo sapiens based on Ensembl version 105. $taxonomyid: 9606 $genome: GRCh38 $sourcetype: ensembl $sourceurl: http://www.ensembl.org $sourcesize: NA $tags: c("105", "Annotation", "AnnotationHubSoftware", "Coverage", "DataImport", "EnsDb", "Ensembl", "Gene", "Protein", "Sequencing", "Transcript") retrieve record with 'object[["AH98047"]]'

edb<- ah[["AH98047"]] downloading 1 resources retrieving 1 resource |======================================================================================| 100%

loading from cache

edb EnsDb for Ensembl: |Backend: SQLite |Db type: EnsDb |Type of Gene ID: Ensembl Gene ID |Supporting package: ensembldb |Db created by: ensembldb package from Bioconductor |script_version: 0.3.7 |Creation time: Sat Dec 18 14:48:15 2021 |ensembl_version: 105 |ensembl_host: localhost |Organism: Homo sapiens |taxonomy_id: 9606 |genome_build: GRCh38 |DBSCHEMAVERSION: 2.2 | No. of genes: 69329. | No. of transcripts: 268255. |Protein data available.

lshep commented 2 months ago

@jorainer

jorainer commented 2 months ago

Hm, I can not reproduce (at least with the current devel versions):

> library(AnnotationHub)
ah <- Annot> ah <- AnnotationHub()
  |======================================================================| 100%

snapshotDate(): 2024-08-01
> query(ah, "EnsDb.Hsapiens.v111")
AnnotationHub with 1 record
# snapshotDate(): 2024-08-01
# names(): AH116291
# $dataprovider: Ensembl
# $species: Homo sapiens
# $rdataclass: EnsDb
# $rdatadateadded: 2023-10-23
# $title: Ensembl 111 EnsDb for Homo sapiens
# $description: Gene and protein annotations for Homo sapiens based on Ensem...
# $taxonomyid: 9606
# $genome: GRCh38
# $sourcetype: ensembl
# $sourceurl: http://www.ensembl.org
# $sourcesize: NA
# $tags: c("111", "Annotation", "AnnotationHubSoftware", "Coverage",
#   "DataImport", "EnsDb", "Ensembl", "Gene", "Protein", "Sequencing",
#   "Transcript") 
# retrieve record with 'object[["AH116291"]]' 
> edb <- ah[["AH116291"]]
downloading 1 resources
retrieving 1 resource
  |======================================================================| 100%

loading from cache
require(“ensembldb”)
> edb
EnsDb for Ensembl:
|Backend: SQLite
|Db type: EnsDb
|Type of Gene ID: Ensembl Gene ID
|Supporting package: ensembldb
|Db created by: ensembldb package from Bioconductor
|script_version: 0.3.10
|Creation time: Tue Jan 16 10:37:47 2024
|ensembl_version: 111
|ensembl_host: localhost
|Organism: Homo sapiens
|taxonomy_id: 9606
|genome_build: GRCh38
|DBSCHEMAVERSION: 2.2
|common_name: human
|species: homo_sapiens
| No. of genes: 72035.
| No. of transcripts: 278721.
|Protein data available.

So, that one clearly has Homo sapiens both in AnnotationHub's metadata as well as within the database.

Now, retrieving the second:

> query(ah, "EnsDb.Mmusculus.v111")
AnnotationHub with 1 record
# snapshotDate(): 2024-08-01
# names(): AH116340
# $dataprovider: Ensembl
# $species: Mus musculus
# $rdataclass: EnsDb
# $rdatadateadded: 2023-10-23
# $title: Ensembl 111 EnsDb for Mus musculus
# $description: Gene and protein annotations for Mus musculus based on Ensem...
# $taxonomyid: 10090
# $genome: GRCm39
# $sourcetype: ensembl
# $sourceurl: http://www.ensembl.org
# $sourcesize: NA
# $tags: c("111", "Annotation", "AnnotationHubSoftware", "Coverage",
#   "DataImport", "EnsDb", "Ensembl", "Gene", "Protein", "Sequencing",
#   "Transcript") 
# retrieve record with 'object[["AH116340"]]' 
> edb2 <- ah[["AH116340"]]
downloading 1 resources
retrieving 1 resource
  |======================================================================| 100%

loading from cache
> edb2
EnsDb for Ensembl:
|Backend: SQLite
|Db type: EnsDb
|Type of Gene ID: Ensembl Gene ID
|Supporting package: ensembldb
|Db created by: ensembldb package from Bioconductor
|script_version: 0.3.10
|Creation time: Tue Jan 16 12:35:18 2024
|ensembl_version: 111
|ensembl_host: localhost
|Organism: Mus musculus
|taxonomy_id: 10090
|genome_build: GRCm39
|DBSCHEMAVERSION: 2.2
|common_name: mouse
|species: mus_musculus
| No. of genes: 57180.
| No. of transcripts: 149132.
|Protein data available.

Also for this the information matches. Could it be some issue with the local cache?

lshep commented 2 months ago

Agreed. I cannot reproduce this either

> temp= ah[["AH116340"]]
downloading 1 resources
retrieving 1 resource
  |======================================================================| 100%

loading from cache
require("ensembldb")
> temp
EnsDb for Ensembl:
|Backend: SQLite
|Db type: EnsDb
|Type of Gene ID: Ensembl Gene ID
|Supporting package: ensembldb
|Db created by: ensembldb package from Bioconductor
|script_version: 0.3.10
|Creation time: Tue Jan 16 12:35:18 2024
|ensembl_version: 111
|ensembl_host: localhost
|Organism: Mus musculus
|taxonomy_id: 10090
|genome_build: GRCm39
|DBSCHEMAVERSION: 2.2
|common_name: mouse
|species: mus_musculus
| No. of genes: 57180.
| No. of transcripts: 149132.
|Protein data available.

Show ensembl version 111.

While we don't like encourage a force redownload and it should only be done once but you can try manually redownloading the local version using a force=TRUE option : ah[["AH116340", force=TRUE]]

mtmorgan commented 2 months ago

What about setting snapshotDate(ah) <- "2023-10-23" as in the original post? Also does the fact that the (current) AH116340 have a creation date after the snapshot date indicate anything?

bioinf-kud commented 2 months ago

Thanks, when I was downloading that , I encountered a network disconnection problem and re-downloaded after I reconnected, maybe something wrong happened during the process. I'll delete the local file and try again.

bioinf-kud commented 2 months ago

Thanks @lshep , it worked using "force=TRUE" to redownload.

lshep commented 2 months ago

Glad that worked.