bridgedb / create-bridgedb-metabolites

Create BridgeDb identity mapping files from HMDB, ChEBI, and Wikidata
Other
4 stars 4 forks source link

[BUG] Mapping ChEMBL IDs to ChEBI IDs using the 16/04/2024 metabolite bridge file produces inconsistent duplicate ChEBI IDs #45

Open pklemmer opened 2 months ago

pklemmer commented 2 months ago

Describe the bug

Using maps() to map ChEMBL IDs from an input df like:

source identifier Cl / CHEMBL1091 Cl / CHEMBL11 Cl / CHEMBL99

to ChEBI ID using the metabolites20240416.bridge file as loadDatabase() argument produces inconsistently mapped duplicate ChEBI IDs:

source identifier target mapping isPrimary Cl / CHEMBL1091 / Ce / CHEBI:17609 / T Cl / CHEMBL1091 / Ce / 17609 / F

but also both duplicate IDs being indicated as primary:

source identifier target mapping isPrimary Cl / CHEMBL11 / Ce / CHEBI:47499 / T Cl / CHEMBL11 / Ce / 47499 / T

or even duplicate IDs being indicated as both true and false primary IDs:

source identifier target mapping isPrimary Cl / CHEMBL1152 / Ce / CHEBI:8380 / T Cl / CHEMBL1152 / Ce / 8380 / F Cl / CHEMBL1152 / Ce / 8380 / T

Provide a minimally reproducible example (reprex)

The 'identifiers' argument for the maps() function is an input dataframe such as:

source identifier Cl / CHEMBL1091 Cl / CHEMBL11 Cl / CHEMBL99

which was generated like this:

metabolite_input <- data.frame( source = rep("Cl", length(mapped_chembls[, 1])), identifier = mapped_chembls[, 1] )

where mapped_chembls is a data frame with a single column containing one CHEMBL ID in the format 'CHEMBL123' per row.

The 'mapper' argument is an absolute file path like:

"C:/Users/user/Documents/GitHub/repo/BridgeDb/metabolites_20240416.bridge"

and the 'target' argument is 'Ce' to map to ChEBI.

Expected behavior

I believe that ChEBI IDs are typically associated with single unique ChEMBL IDs, so an ideal output should look like:

source identifier target mapping isPrimary Cl / CHEMBL1152 / Ce / CHEBI:8380 / T

With the "CHEBI:" prefix in front of the actual ID.

R Session Information

Please report the output of either sessionInfo() or sessioninfo::session_info() here.

```R options(width = 120) R version 4.3.3 (2024-02-29 ucrt) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 10 x64 (build 19045) Matrix products: default locale: [1] LC_COLLATE=English_Europe.utf8 LC_CTYPE=English_Europe.utf8 LC_MONETARY=English_Europe.utf8 LC_NUMERIC=C LC_TIME=English_Europe.utf8 time zone: Europe/Berlin tzcode source: internal attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] reprex_2.1.0 curl_5.2.1 BridgeDbR_2.10.2 rJava_1.0-11 RCy3_2.20.2 rWikiPathways_1.20.0 tidyr_1.3.1 rvest_1.0.4 [9] gprofiler2_0.2.3 stringr_1.5.1 httr_1.4.7 dplyr_1.1.4 loaded via a namespace (and not attached): [1] gtable_0.3.4 rjson_0.2.21 ggplot2_3.5.0 htmlwidgets_1.6.4 caTools_1.18.2 vctrs_0.6.5 tools_4.3.3 bitops_1.0-7 [9] generics_0.1.3 stats4_4.3.3 base64url_1.4 tibble_3.2.1 fansi_1.0.6 pkgconfig_2.0.3 KernSmooth_2.23-22 data.table_1.15.4 [17] RColorBrewer_1.1-3 uuid_1.2-0 graph_1.78.0 lifecycle_1.0.4 compiler_4.3.3 gplots_3.1.3.1 munsell_0.5.1 repr_1.1.7 [25] uchardet_1.1.1 htmltools_0.5.8.1 RCurl_1.98-1.14 lazyeval_0.2.2 plotly_4.10.4 pillar_1.9.0 crayon_1.5.2 gtools_3.9.5 [33] tidyselect_1.2.1 digest_0.6.35 stringi_1.8.3 purrr_1.0.2 RJSONIO_1.3-1.9 fastmap_1.1.1 grid_4.3.3 colorspace_2.1-0 [41] cli_3.6.2 magrittr_2.0.3 base64enc_0.1-3 XML_3.99-0.16.1 utf8_1.2.4 IRdisplay_1.1 withr_3.0.0 scales_1.3.0 [49] backports_1.4.1 IRkernel_1.3.2 pbdZMQ_0.3-10 evaluate_0.23 viridisLite_0.4.2 rlang_1.1.3 glue_1.7.0 selectr_0.4-2 [57] BiocManager_1.30.22 xml2_1.3.6 BiocGenerics_0.46.0 pkgload_1.3.4 rstudioapi_0.16.0 jsonlite_1.8.8 R6_2.5.1 fs_1.6.3 ```

Indicate whether BiocManager::valid() returns TRUE.

BiocManager::valid() returns "4 packages out-of-date; 0 packages too new"

Is the package installed via bioconda?

BridgeDbR is installed via BiocManager.

egonw commented 2 months ago

Thanks for filing the issue! I need to get some details together. The problem is probably in the ID mapping file, and therefore caused by how we create it, hence the transfer.