MaastrichtU-IDS / predict-drug-target

Using ESM2 protein embeddings and MolecularTransformer drug embeddings to train a linear classifier to predict potential drug-targets interactions
https://predict-drug-target.137.120.31.160.nip.io
MIT License
8 stars 2 forks source link

Get SMILES and AA sequences #1

Open vemonet opened 1 year ago

vemonet commented 1 year ago

Get SMILES for PubChem Compount (here for aspirin CID 2244):

Get AA sequence for a protein (check the sequence key):

sandrine-muller-research commented 1 year ago

you can get SMILES from MolePro as well (depending on your input types (ID or chemical nemaes) ) you can use some of these endpoints: https://molepro.broadinstitute.org/molecular_data_provider/assets/lib/swagger-ui/index.html?url=/molecular_data_provider/assets/openapi.json with a POST query to /compound/by_id, you'll get the following json We have put in place a curated way to elect best structures given chemical names where some entries have been curated already (the endpoint by name though is still work in progress and has some in progress towards curation but works pretty well).

vemonet commented 1 year ago

Thanks a lot @sandrine-muller-research ! Just CHEMBL ID is quite limited, so I am interested in anything that will cover a wider ranger of IDs. And MolePro seems to have a really nice API

But I lack of knowledge in the SMILES system, maybe you can enlighten me!

For some compounds the MolePro API is returning multiple elements, e.g. for CHEMBL.COMPOUND:CHEMBL535 we get 2 elements:

When I use the EBI API I get 1 "canonical_smiles" for CHEMBL535: CCN(CC)CCNC(=O)c1c(C)[nH]c(/C=C2\\C(=O)Nc3ccc(F)cc32)c1C

Are canonical smiles different than "regular" smiles? Can I easily generate a compound "canonical smiles" from the smiles of its elements?

vemonet commented 1 year ago

According to chatty jeepity it should be as simple as this:

from rdkit import Chem

# SMILES representations of the elements
smiles_carbon = 'C'
smiles_hydrogen = 'H'
smiles_oxygen = 'O'

# Combine the SMILES of elements to create a chemical compound
compound_smiles = f'{smiles_carbon}{smiles_hydrogen*4}{smiles_oxygen*2}'

# Generate the canonical SMILES
compound_molecule = Chem.MolFromSmiles(compound_smiles)

if compound_molecule:
    canonical_smiles = Chem.MolToSmiles(compound_molecule, isomericSmiles=False)
    print(f'Canonical SMILES of the compound: {canonical_smiles}')
else:
    print('Invalid SMILES for the compound')
vemonet commented 1 year ago

One of the problem faced: OpenTargets uses ENSEMBL gene IDs instead of directly using protein IDs (most of the interactions they describe are between drugs and proteins, not drugs and genes)

But a gene can code many proteins, so the interactions shared by OpenTargets are highly not clear and need to be manually fixed. Why could not they directly use protein IDs? That's a big question...

Also the following APIs are not allowing us to send bulk request to find sequences (PubChem, Chembl, ensembl)

So we need to send like 5000 requests to get sequences for all our drugs/proteins. Which is quite intensive for their API, which fails for a lot of requests. It would have been so easy for them to implement bulk calls, but it would have reduced the amount of queries done to their service, which is probably the number they report to get funding (so they want it to be high, even if it means making their service worse)

Not really optimal

micheldumontier commented 1 year ago

ya, you can find the relationship between genes and protein from the targets data. http://ftp.ebi.ac.uk/pub/databases/opentargets/platform/23.09/output/etl/json/targets/

there's a field for proteinIds

are there alternative APIs you could use? does monarch give sequence data?

vemonet commented 1 year ago

Ok, too bad they did not do their own work themselves

EBI CHEMBL seems quite all over the places, for example the ensembl ID ENSG00000198838 can be matched to more than 12 different proteins: https://www.ebi.ac.uk/proteins/api/proteins/Ensembl:ENSG00000198838?offset=0&size=100&format=json

All matches have the same "submittedName" for the protein: "Ryanodine receptor 3"

But the sequences are completely different:

And no match in the Monarch API: https://api.monarchinitiative.org/api/bioentity/anatomy/ENSEMBL%3AENSG00000198838/genes?rows=100&facet=false&unselect_evidence=false&exclude_automatic_assertions=false&fetch_objects=false&use_compact_associations=false&direct=false&direct_taxon=false

vemonet commented 1 year ago

We have no other choices than to use the mappings published by opentargets, because only them know which (protein) target they talk about when giving a super ambiguous ensembl ID

The real question now is: can we trust this dataset now that we have seen how it's been made? I guess that's like dutch food, "yes but don't expect it to be good quality"