What nomenclature is used for the CMap and LINCS L1000 compounds?

enricoferrero commented 7 years ago

Not really an issue, just a question I couldn't find the answer to in the docs.

What nomenclature or ontology is used to name compounds in the CMap and LINCS L1000 PharmacoSets and perturbation signatures?

E.g.:

cmap.pertsig <- downloadPertSig("CMAP")
> head(cmap.pertsig[1,,])
                     estimate      tstat     pvalue       fdr
metformin        3.092869e-04  1.4452762 0.15810292 0.7198122
phenformin       3.435874e-03  0.7029627 0.48700895 0.9353019
valproic acid   -2.630498e-05 -1.8301544 0.06983943 0.2197652
estradiol       -6.535347e-03 -1.1583861 0.24912744 0.8805570
alpha-estradiol -8.883201e-01 -0.2010900 0.84117989 0.9877316
dexamethasone   -3.806017e-03 -0.5188979 0.60740221 0.9707613

Where are the names metmorfin, phenformin etc. coming from? Are unique IDs (e.g.: ChEMBL IDs) stored somewhere in the PharmacoSet or perturbation signature objects?

Thanks!

p-smirnov commented 7 years ago

Hi @enricoferrero:

For the CMAP dataset, if you are able to load the PharmacoSet object into R then running drugInfo(CMAP) will give a table of data about the drugs.

Out of universal identifiers we only have the ChemBank ID (CBID column) for these drugs, provided by the original study authors.

For the L1000 data, the study authors provided "canonical_smiles", "inchi_key", "inchi_string" columns in their annotations, also stored in the drugInfo(L1000_compounds) table. However, in our experience there can be both missing and incorrect entries, which is why these ids were not used as the drug identifiers inside the PharmacoSet objects.

enricoferrero commented 7 years ago

@p-smirnov: thank you, the inChIKey should be handy. What type of ID is the one in the pert_id column? UniChem does not recognise it as a LINCS identifier.

bhklab / PharmacoGx

What nomenclature is used for the CMap and LINCS L1000 compounds? #21