Open vemonet opened 1 year ago
you can get SMILES from MolePro as well (depending on your input types (ID or chemical nemaes) ) you can use some of these endpoints: https://molepro.broadinstitute.org/molecular_data_provider/assets/lib/swagger-ui/index.html?url=/molecular_data_provider/assets/openapi.json with a POST query to /compound/by_id, you'll get the following json We have put in place a curated way to elect best structures given chemical names where some entries have been curated already (the endpoint by name though is still work in progress and has some in progress towards curation but works pretty well).
Thanks a lot @sandrine-muller-research ! Just CHEMBL ID is quite limited, so I am interested in anything that will cover a wider ranger of IDs. And MolePro seems to have a really nice API
But I lack of knowledge in the SMILES system, maybe you can enlighten me!
For some compounds the MolePro API is returning multiple elements, e.g. for CHEMBL.COMPOUND:CHEMBL535
we get 2 elements:
CCN(CC)CCNC(=O)C1=C(C)NC(\\C=C2/C(=O)NC3=C2C=C(F)C=C3)=C1C
CCN(CC)CCNC(=O)C1=C(NC(=C1C)/C=C\\2/C3=C(C=CC(=C3)F)NC2=O)C
When I use the EBI API I get 1 "canonical_smiles" for CHEMBL535: CCN(CC)CCNC(=O)c1c(C)[nH]c(/C=C2\\C(=O)Nc3ccc(F)cc32)c1C
Are canonical smiles different than "regular" smiles? Can I easily generate a compound "canonical smiles" from the smiles of its elements?
According to chatty jeepity it should be as simple as this:
from rdkit import Chem
# SMILES representations of the elements
smiles_carbon = 'C'
smiles_hydrogen = 'H'
smiles_oxygen = 'O'
# Combine the SMILES of elements to create a chemical compound
compound_smiles = f'{smiles_carbon}{smiles_hydrogen*4}{smiles_oxygen*2}'
# Generate the canonical SMILES
compound_molecule = Chem.MolFromSmiles(compound_smiles)
if compound_molecule:
canonical_smiles = Chem.MolToSmiles(compound_molecule, isomericSmiles=False)
print(f'Canonical SMILES of the compound: {canonical_smiles}')
else:
print('Invalid SMILES for the compound')
One of the problem faced: OpenTargets uses ENSEMBL gene IDs instead of directly using protein IDs (most of the interactions they describe are between drugs and proteins, not drugs and genes)
But a gene can code many proteins, so the interactions shared by OpenTargets are highly not clear and need to be manually fixed. Why could not they directly use protein IDs? That's a big question...
Also the following APIs are not allowing us to send bulk request to find sequences (PubChem, Chembl, ensembl)
So we need to send like 5000 requests to get sequences for all our drugs/proteins. Which is quite intensive for their API, which fails for a lot of requests. It would have been so easy for them to implement bulk calls, but it would have reduced the amount of queries done to their service, which is probably the number they report to get funding (so they want it to be high, even if it means making their service worse)
Not really optimal
ya, you can find the relationship between genes and protein from the targets data. http://ftp.ebi.ac.uk/pub/databases/opentargets/platform/23.09/output/etl/json/targets/
there's a field for proteinIds
are there alternative APIs you could use? does monarch give sequence data?
Ok, too bad they did not do their own work themselves
EBI CHEMBL seems quite all over the places, for example the ensembl ID ENSG00000198838
can be matched to more than 12 different proteins: https://www.ebi.ac.uk/proteins/api/proteins/Ensembl:ENSG00000198838?offset=0&size=100&format=json
All matches have the same "submittedName" for the protein: "Ryanodine receptor 3"
But the sequences are completely different:
XEDEIQFLRTYIPPDLCVCNFVLEQSLSVRALQEMLANTGENGGEG
XLEIAGEEEEDGSLEPASAFAMACASVKRNVTDFLKRATLKNLRKQYRNVKKMTAKELVKVLFSFFWMLFVGLFQLLFTILGGIFQILWSTVFGGGLVEGAKNIRVTKILGDMPDPTQFGIHDDTMEAERAEVMEPGITTELVHFIKGEKGDTDIMSDLFGLHPKKEGSLKHGPEVGLGDLSEIIGKDEPPTLESTVQKKRKAQAAEMKAANEAEGKVESEKADMEDGEKEDKDKEEEQAEYLWTEVTKKKKRRCGQKVEKPEAFTANFFKGLEIYQTKLLPGH
XGRCAPEMHLIQTGKGEAIRIRSILRSLVPTEDLVGIISIPLKLPSLNKDGSVSEPDMAANFCPDHKAPMVLFLDRVYGIKDQTFLLHLLEVGFLPDLRASASLDTVSLSTTEAALALNRYICSAVLPLLTRCAPLFAGTEHCTSLIDSTLQTIYRLSKGRSLTKAQRDTIEECLLAICNHLRPSMLQQLLRRLVFDVPQLNEYCKMPLKLLTNHYEQCWKYYCLPSGWGSYGLAVEEELHLTEKLFWGIFDSLSHKKYDPDLFRMALPCLSAIAGALPPDYLDTRITATLEKQISVDADGNFDPKPINTMNFSLPEKLEYIVTKYAEHSHDKWACDKSQSGWKYGISLDENVKTHPLIRPFKTLTEKEKEIYRWPARESLKTMLAVGWTVERTKEGEALVQQRENEKLRSVSQANQGNSYSPAPLDLSNVVLSRELQGMVEVVAENYHNIWAKKKKLELESKGGGSHPLLVPYDTLTAKEKFKDREKAQDLFKFLQVNGIIVSRGMKDMELDASSMEKRFAYKFLKKILKYVDSAQEFIAHLEAIVSSGKTEKSPRDQEIKFFAKVLLPLVDQYFTSHCLYFLSSPLKPLSSSGYASHKEKEMVAGLFCKLAALVRHRISLFGSDSTTMVSCLHILAQTLDTRTVMKSGSELVKAGLRAFFENAAEDLEKTSENLKLGKFTHSRTQIKGVSQNINYTTVALLPILTSIFEHVTQHQFGMDLLLGDVQISCYHILCSLYSLGTGKNIYVERQRPALGECLASLAAAIPVAFLEPTLNRYNPLSVFNTKTPRERSILGMPDTVEDMCPDIPQLEGLMKEINDLAESGARYTEMPHVIEVILPMLCNYLSYWWERGPENLPPSTGPCCTKVTSEHLSLILGNILKIINNNLGIDEASWMKRIAVYAQPIISKARPDLLRSHFIPTLEKLKKKAVKTVQEEEQLKADGKGDTQEAELLILDEFAVLCRDLYAFYPMLIRYVDNNRSNWLKSPDADSDQLFRMVAEVFILWCKSHNFKREEQNFVIQNEINNLAFLTGDSKSKMSKAMQVKVQVKCMTCLFCPSIRGAGLWPPLHCDHHGGGREWIFPPGGPPGLLQGRQLPVKE
And no match in the Monarch API: https://api.monarchinitiative.org/api/bioentity/anatomy/ENSEMBL%3AENSG00000198838/genes?rows=100&facet=false&unselect_evidence=false&exclude_automatic_assertions=false&fetch_objects=false&use_compact_associations=false&direct=false&direct_taxon=false
We have no other choices than to use the mappings published by opentargets, because only them know which (protein) target they talk about when giving a super ambiguous ensembl ID
The real question now is: can we trust this dataset now that we have seen how it's been made? I guess that's like dutch food, "yes but don't expect it to be good quality"
Get SMILES for PubChem Compount (here for aspirin CID 2244):
Get AA sequence for a protein (check the sequence key):