Open pklemmer opened 2 months ago
Using maps() to map ChEMBL IDs from an input df like:
source identifier Cl / CHEMBL1091 Cl / CHEMBL11 Cl / CHEMBL99
to ChEBI ID using the metabolites20240416.bridge file as loadDatabase() argument produces inconsistently mapped duplicate ChEBI IDs:
source identifier target mapping isPrimary Cl / CHEMBL1091 / Ce / CHEBI:17609 / T Cl / CHEMBL1091 / Ce / 17609 / F
but also both duplicate IDs being indicated as primary:
source identifier target mapping isPrimary Cl / CHEMBL11 / Ce / CHEBI:47499 / T Cl / CHEMBL11 / Ce / 47499 / T
or even duplicate IDs being indicated as both true and false primary IDs:
source identifier target mapping isPrimary Cl / CHEMBL1152 / Ce / CHEBI:8380 / T Cl / CHEMBL1152 / Ce / 8380 / F Cl / CHEMBL1152 / Ce / 8380 / T
The 'identifiers' argument for the maps() function is an input dataframe such as:
which was generated like this:
metabolite_input <- data.frame( source = rep("Cl", length(mapped_chembls[, 1])), identifier = mapped_chembls[, 1] )
where mapped_chembls is a data frame with a single column containing one CHEMBL ID in the format 'CHEMBL123' per row.
The 'mapper' argument is an absolute file path like:
"C:/Users/user/Documents/GitHub/repo/BridgeDb/metabolites_20240416.bridge"
and the 'target' argument is 'Ce' to map to ChEBI.
I believe that ChEBI IDs are typically associated with single unique ChEMBL IDs, so an ideal output should look like:
source identifier target mapping isPrimary Cl / CHEMBL1152 / Ce / CHEBI:8380 / T
With the "CHEBI:" prefix in front of the actual ID.
Please report the output of either sessionInfo() or sessioninfo::session_info() here.
sessionInfo()
sessioninfo::session_info()
Indicate whether BiocManager::valid() returns TRUE.
BiocManager::valid()
TRUE
BiocManager::valid() returns "4 packages out-of-date; 0 packages too new"
BridgeDbR is installed via BiocManager.
Thanks for filing the issue! I need to get some details together. The problem is probably in the ID mapping file, and therefore caused by how we create it, hence the transfer.
Describe the bug
Using maps() to map ChEMBL IDs from an input df like:
source identifier Cl / CHEMBL1091 Cl / CHEMBL11 Cl / CHEMBL99
to ChEBI ID using the metabolites20240416.bridge file as loadDatabase() argument produces inconsistently mapped duplicate ChEBI IDs:
source identifier target mapping isPrimary Cl / CHEMBL1091 / Ce / CHEBI:17609 / T Cl / CHEMBL1091 / Ce / 17609 / F
but also both duplicate IDs being indicated as primary:
source identifier target mapping isPrimary Cl / CHEMBL11 / Ce / CHEBI:47499 / T Cl / CHEMBL11 / Ce / 47499 / T
or even duplicate IDs being indicated as both true and false primary IDs:
source identifier target mapping isPrimary Cl / CHEMBL1152 / Ce / CHEBI:8380 / T Cl / CHEMBL1152 / Ce / 8380 / F Cl / CHEMBL1152 / Ce / 8380 / T
Provide a minimally reproducible example (reprex)
The 'identifiers' argument for the maps() function is an input dataframe such as:
source identifier Cl / CHEMBL1091 Cl / CHEMBL11 Cl / CHEMBL99
which was generated like this:
metabolite_input <- data.frame( source = rep("Cl", length(mapped_chembls[, 1])), identifier = mapped_chembls[, 1] )
where mapped_chembls is a data frame with a single column containing one CHEMBL ID in the format 'CHEMBL123' per row.
The 'mapper' argument is an absolute file path like:
"C:/Users/user/Documents/GitHub/repo/BridgeDb/metabolites_20240416.bridge"
and the 'target' argument is 'Ce' to map to ChEBI.
Expected behavior
I believe that ChEBI IDs are typically associated with single unique ChEMBL IDs, so an ideal output should look like:
source identifier target mapping isPrimary Cl / CHEMBL1152 / Ce / CHEBI:8380 / T
With the "CHEBI:" prefix in front of the actual ID.
R Session Information
Please report the output of either
sessionInfo()
orsessioninfo::session_info()
here.Indicate whether
BiocManager::valid()
returnsTRUE
.BiocManager::valid() returns "4 packages out-of-date; 0 packages too new"
Is the package installed via bioconda?
BridgeDbR is installed via BiocManager.