help with querying - Githubissues

Hello, I am trying to understand the schema and would like to, for example, find all drugs for a given gene. I am looking at TOP1 gene as an example. First, I get all of the molregno for that gene with (I am using an ORM so it's a bit weird on the aliases):

('SELECT DISTINCT "t1"."molregno" FROM "molecule_dictionary" AS "t1" LEFT OUTER JOIN "compound_records" AS "t2" ON ("t1"."molregno" = "t2"."molregno") LEFT OUTER JOIN "activities" AS "t3" ON ("t3"."record_id" = "t2"."record_id") LEFT OUTER JOIN "assays" AS "t4" ON ("t4"."assay_id" = "t3"."assay_id") LEFT OUTER JOIN "target_dictionary" AS "t5" ON ("t5"."tid" = "t4"."tid") LEFT OUTER JOIN "drug_mechanism" AS "t6" ON (("t6"."molregno" = "t3"."molregno") AND ("t6"."tid" = ?)) WHERE (("t5"."chembl_id" = ?) AND ("t1"."max_phase" IS NOT ?))', [62, 'CHEMBL1781', None])

Then, following https://github.com/chembl/chembl_data_issues/issues/3#issuecomment-1788917142, I get all associated molregnos for each of those (one at a time in a loop for now) with:

('SELECT "t1"."molregno" FROM "molecule_hierarchy" AS "t1" WHERE (("t1"."parent_molregno" = ?) OR ("t1"."molregno" = ?))', [1763, 1763])

I think this then gives me all of the molregnos that I would look at. Then I query for molregno in [list] with:

('SELECT DISTINCT [SNIP_LOTS_OF_COLUMNS] FROM "molecule_dictionary" AS "t1" 
    LEFT OUTER JOIN "compound_records" AS "t3" ON ("t1"."molregno" = "t3"."molregno") 
    LEFT OUTER JOIN "activities" AS "t4" ON ("t4"."record_id" = "t3"."record_id") 
    LEFT OUTER JOIN "assays" AS "t5" ON ("t5"."assay_id" = "t4"."assay_id") 
    LEFT OUTER JOIN "target_dictionary" AS "t6" ON ("t6"."tid" = "t5"."tid") 
    LEFT OUTER JOIN "drug_mechanism" AS "t2" ON (("t2"."molregno" = "t4"."molregno") AND ("t2"."tid" = ?)) 
WHERE (("t1"."molregno" IN (?, ?, ?)) AND ("t1"."max_phase" IS NOT ?))', [62, 1150481, 1763, 1955919, None])

This gives me sane results, however, in the web output for TOP1 there is AFELETECAN in the 22 Drugs and Clinical Candidates but that does not appear in the set of results from my queries. I'd like to understand why that's the case.

There are also many drugs that are in my list, most notably TOPOTECAN that do not appear in the list from the web. Why is that the case?

I know this is a complex query and issue, so please let me know if I can add any information. The full list of drugs I get for TOP1 is:

7-ETHYL-10-HYDROXYCAMPTOTHECIN,9-AMINOCAMPTOTHECIN,AMONAFIDE,AMONAFIDE L-MALATE,AMSACRINE,ARTESUNATE,BETULINIC ACID,CAMPTOTHECIN,CURCUMIN,DAIDZEIN,DANIQUIDONE,DAUNORUBICIN,DAUNORUBICIN CITRATE,DAUNORUBICIN HYDROCHLORIDE,DIFLOMOTECAN,DOXORUBICIN,DOXORUBICIN HYDROCHLORIDE,EDOTECARIN,ETOPOSIDE,FISETIN,GENISTEIN,GENZ-644282,GIMATECAN,GINSENOSIDE RD,GINSENOSIDE RE,HELIOMYCIN,HOMIDIUM BROMIDE,HYDROXYCAMPTOTHECIN,IRINOTECAN,IRINOTECAN HYDROCHLORIDE,LMP-744,LURTOTECAN,LURTOTECAN DIHYDROCHLORIDE,LUTEOLIN,OLEIC ACID,PALMITIC ACID,PLUMBAGIN,PODOFILOX,PYRAZOLOACRIDINE,QUERCETIN,RESCINNAMINE,RESERPINE,RIBAVIRIN,SECURININE,STALLIMYCIN,STALLIMYCIN HYDROCHLORIDE,TENIPOSIDE,TOLFENAMIC ACID,TOPOTECAN,TOPOTECAN HYDROCHLORIDE,YOHIMBINE,YOHIMBINE HYDROCHLORIDE

thanks in advance!

Hello,

Thanks for the question.

Assays are the experiments that record the biological activity of drug-like molecules against biological targets. Tested compounds may be approved drugs, clinical candidates, or other drug-like compounds. The biological targets span a range of molecular and non-molecular entities (e.g. single proteins, protein complexes, nucleic acids, tissues, and cells). We include both positive and negative data in ChEMBL and so a compound may be active or inactive against the target. If you’re interested only in active compounds, then you will need to filter the bioactivity data.

Therapeutic targets (those involved in the mechanism for a given indication) are manually curated and captured separately in the drug mechanisms table. Please note it’s possible that a mechanism has been curated for a drug but the drug does not have bioactivity data associated with since it hasn’t appeared in our core journals (https://chembl.gitbook.io/chembl-interface-documentation/frequently-asked-questions/general-questions#what-literature-coverage-is-there-in-chembl).

Overall, the drug_mechanisms are manually curated with the putative therapeutic target whereas the assay and activity tables record literature-derived data where drugs have been tested against a range of targets and may be active or inactive.

The query above extracts both bioactivity and drug mechanism data from ChEMBL. Since the molregnos are linked via the activities table to the drug mechanisms table, only drugs that have a mechanism AND bioactivity data in ChEMBL will be identified. Therefore, AFLETECAN is omitted since we’ve curated the mechanism, but no bioactivity data has been extracted. Please note that the query will also return compounds tested against TOP1 (whether active or inactive) where TOP1 is not the therapeutic target.

Thanks for flagging up the issue with Topetecan; the drug has been mapped to this target -https://www.ebi.ac.uk/chembl/target_report_card/CHEMBL2362989/ and so we’ll review and fix.

Hi Emma, thanks very much for the thorough replay and please excuse my slow response. I have 2 questions:

You write:

If you’re interested only in active compounds, then you will need to filter the bioactivity data.

Can you give more guidance about how to do that? I assume it's using the activities table, but don't see a clear way to do this.

You write:

The query above extracts both bioactivity and drug mechanism data from ChEMBL. Since the molregnos are linked via the activities table to the drug mechanisms table, only drugs that have a mechanism AND bioactivity data in ChEMBL will be identified.

I think I understand what you mean, but I haven't been able to conjure up the query/queries to do this. Could you also show a bit more how I might get output similar to your web page without requiring the link through those tables?

thanks in advance.

Thanks for the feedback.

I’ve included an example query of how you can extract drug_mechanisms from a local version of the ChEMBL database (below). However, you will need to review this query and check that this is suitable for your research question. Please note that some compounds are found in several alternative forms in ChEMBL (https://chembl.blogspot.com/2020/09/molecule-hierarchy.html). On the ChEMBL interface, we typically map mechanisms across the parent compound as well as approved alternative forms. In the underlying mechanisms table, the mechanisms may be mapped to either the parent compounds or approved forms when known. It is possible to retrieve a mechanism mapped to any member of a compound family by linking alternative forms through the molecule_hierarchy table.

select distinct md2.chembl_id, md2.pref_name, md.pref_name as alternative_form_pref_name, md.chembl_ID as alternative_form_chembl_ID, dm.*
from chembl_33.molecule_dictionary md
join chembl_33.compound_records cr on cr.molregno = md.molregno
join chembl_33.molecule_hierarchy mh on md.molregno = mh.molregno
join chembl_33.molecule_dictionary md2 on mh.parent_molregno =md2.molregno
left join chembl_33.drug_mechanism dm on dm.molregno = md.molregno -- join across all alternative forms rather than by parent
where cr.src_ID in (8, 9, 12, 13, 36, 41, 42, 53,63) -- these are our drug data sources
and mec_ID is not null
;

The bioactivity data is stored in the ACTIVITIES table. We extract data directly from the literature and standardise this where possible. For example, the literature may contain % inhibition data (e.g. type = Inhibition, relation = ‘=’, value = 50, units = %) or could provide IC50 data (e.g. type = ‘IC50’, relation = ‘=’, value = 10, units = uM). We run all the data through an activity standardiser to convert the raw data to standard data for ChEMBL. During this process, certain activity types undergo standardisation whereas other values are transferred ‘as they are’ into these 'standard_' fields which contain the activity data post-standardisation. You can see the data types that undergo standardisation in the standard_types_lookup table. To filter the bioactivity data, you could decide on a threshold for active compounds (e.g. 100 nM, 1 uM, 10 uM) and filter compounds according to your threshold (e.g. < 1 uM). Alternatively, you could consider using the pChEMBL value (https://chembl.gitbook.io/chembl-interface-documentation/ frequently-asked-questions/chembl-data-questions#what-is-pchembl). It's always worth considering the activity_comment alongside your extracted data (https://chembl.blogspot.com/2020/08/using-chembl-activity-comments.html) as well as performing other data checks (https://chembl.blogspot.com/2020/10/data-checks.html). We also include non-numerical values and these can be found as an activity_comment or text_value (e.g. active, inactive, toxic). We have some example SQL queries that could provide a starting point (https://chembl.gitbook.io/chembl-interface-documentation/frequently-asked-questions/schema-questions-and-sql-examples) but please get back in touch if you would like more guidance.

Our training materials may also be useful and can be found here -

https://chembl.gitbook.io/chembl-interface-documentation/training-material

Some further information on the bioactivity data can be found here:

https://chembl.gitbook.io/chembl-interface-documentation/ frequently-asked-questions/chembl-data-questions# can-you-provide-more-details-on-bioactivity-data-standardisation

https://chembl.gitbook.io/chembl-interface-documentation/ frequently-asked-questions/chembl-data-questions# why-are-there-so-many-different-types-of-standard-units-in-the-database

Hopefully this helps but please let us know if you have further questions.

Thank you very much!

chembl / chembl_data_issues

help with querying #10