Triage of the Generative Suggestions

mattodd commented 2 years ago

Relevant to other issues (e.g. #15) but a specific question related to the large numbers of molecules arising from the @kipUNC generative methods. How do we triage suggestions?

1) Kostya mentioned that filters can be trivially applied. It seems to me that we want to be able to move beyond fragments, and still have decent solubility for soaking experiments if needed. So shall we apply a filter to say that the molecules need to be to least 3 heavy atoms heavier than the original fragments? Roughly what range does that give, if we say that the MW is 40-60 higher than the compounds that were found to bind? And let's say logP under 3?

2) Kostya thinks that we may make it more likely that the suggestions are purchaseable if we focus on the structures that are common to all output libraries. He suggests looking for 10 random molecules (after we've triaged as above) that are not common to all libraries, but which are unique to the e.g. Chembl-trained library.

3) Once picked, let's have them all docked for reasonableness.

If agree (?), we can ask Kostya to apply the filter and come up with new suggestions.

drc007 commented 2 years ago

@mattodd @kipUNC How many is a "large number of molecules"?

I can search for commercial availability if less than 100K molecules.

TomkUCL commented 2 years ago

@mattodd @kipUNC The average molecular weight of the initial fragments from issue #2 (pictured below) is 209, so I have set the fragment filters that we are looking for from libraries A, C, and E to MW = 249 to 269, and logP less than 3. Data warrior allows these filters to be put in as sliding scales which can also be changed as required.

I've used the data warrior sliding scales to select for molecular weight and logP values described above using the following pathway: Chemistry > From Chemical Structure > Calculate Properties... > select 'Total average molweight in g/mol', 'cLogP', then simply type in the selected range values for each scale.

Library_A_hits triaged.zip Library_C_hits triaged.zip Library_E_hits triaged.zip

Library A ( Pre-trained on ChEMBL, Hits from filtered, unbiased model using pharmacophore scoring function): 99 compounds. Library C ( Pre-trained on ChEMBL, Use Library B to fine-tune generative model): 182 compounds. Library E (Pre-trained on ChEMBL Reinforcement learning using pharmacophore scoring function): 74 compounds.

mattodd commented 2 years ago

Great @TomkUCL ! Yes, @drc007 we are trying to separate these compounds into things that can be trivially bought (i.e. <£100 for 5 mg, ballpark) vs those that would need to be made by a lab by a talented human. If you were able to do that subdivision, that'd be really useful.

TomkUCL commented 2 years ago

@drc007 @edwintse Do you have any recommendations regarding how best to search these lists for commercial availability? I see that datawarrior has built-in Search Enamine Building Blocks, but i'm not sure whether this is suitable for this many compounds. Thanks!

drc007 commented 2 years ago

@TomkUCL If you send me a list of structures (as sdf or SMILES) I can search them for you.

Cheers,

Chris

On 18 Feb 2022, at 16:22, Tom Knight @.***> wrote:

@drc007 https://github.com/drc007 Do you have any recommendations regarding how best to search these lists for commercial availability? I see that datawarrior has built-in Search Enamine Building Blocks, but i'm not sure whether this is suitable for this many compounds. Thanks!

— Reply to this email directly, view it on GitHub https://github.com/StructuralGenomicsConsortium/CNP4-Nsp13-C-terminus-B/issues/17#issuecomment-1044784136, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABWAUREKLRJUSLJQODMGPPTU3ZW4BANCNFSM5N5XBUPA. Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub. You are receiving this because you were mentioned.

TomkUCL commented 2 years ago

@TomkUCL If you send me a list of structures (as sdf or SMILES) I can search them for you. Cheers, Chris

Thanks Chris, here are the SMILES lists:

Libraries A C E triaged hits SMILES.xlsx

drc007 commented 2 years ago

@TomkUCL Annotated with vendor info. LibraryEvendors.csv LibraryCvendors.csv LibraryAvendors.csv

TomkUCL commented 2 years ago

@TomkUCL Annotated with vendor info. LibraryEvendors.csv LibraryCvendors.csv LibraryAvendors.csv

Is row 1 all suppliers?

drc007 commented 2 years ago

@TomkUCL Row 1 is all the suppliers that claim to have a least 1 of the compounds in stock. The ZINC ID can also be used to search "make to order" on the ZINC website https://zinc.docking.org/substances/home/

TomkUCL commented 2 years ago

@drc007 Sorry, I'm not familiar with all of these vendors. I have interpreted the list so that those SMILES with no vendor codes beyond column B (InChIKey) I am assuming are commercially unavailable, which I have highlighted so that you get something looking like this...

Is this correct?

drc007 commented 2 years ago

@TomkUCL I've only included vendor codes for those compounds that are "in stock". For some compounds with no vendor codes there may be a ZINC ID or PubChem ID that can be used to identify sources that can make a compound if ordered. The vendors are all named and detailed here https://zinc.docking.org/catalogs/

Vendors like eMolecules, Mcule Molport aggregate multiple catalogues and act as a intermediary. If you focus on these vendors it often saves time.

TomkUCL commented 2 years ago

@drc007 Brilliant, thanks!

TomkUCL commented 2 years ago

I have run a quick Mcule search of the 99 SMILES from LibraryAvendors, which threw back a couple of results at $175 + $179 shipping...

TomkUCL commented 2 years ago

@kipUNC Here is the first SMILES list of compounds from library A (C and E to follow here shortly). These are all of the ChEMBL trained libraries. Library A chemspace-search.xlsx Note the SMILES doesn't seem to be regioselective for alkenes, so both regioisomers (E/Z) are included in the SMILES.

I have generated these lists by inserting the LibraryAvenors.csv, LibraryCvenors.csv, and LibraryEvenors.csv files described above from Chris into Data Warrior as a similarity chart (Library A for example shown below here:)

I then selected 30 compounds from LibraryAvenors.csv, choosing both compounds from clusters (those with similar structures) as well as some outliers to try and maximise the chemical space represented by these fragments: Finally, I ran these SMILES through a final commercial search on Chemspace to generate the final list below of compounds for docking given. Those in green are listed as cheap (i.e. <$100 per 1 mg), so you can exclude those if you wish.

Does this work ok with you? If so then I will also upload the same lists for Libraries C and E here shortly.

TomkUCL commented 2 years ago

@mattodd @kipUNC The average molecular weight of the initial fragments from issue #2 (pictured below) is 209, so I have set the fragment filters that we are looking for from libraries A, C, and E to MW = 249 to 269, and logP less than 3. Data warrior allows these filters to be put in as sliding scales which can also be changed as required.

I've used the data warrior sliding scales to select for molecular weight and logP values described above using the following pathway: Chemistry > From Chemical Structure > Calculate Properties... > select 'Total average molweight in g/mol', 'cLogP', then simply type in the selected range values for each scale.

Library_A_hits triaged.zip Library_C_hits triaged.zip Library_E_hits triaged.zip

Library A ( Pre-trained on ChEMBL, Hits from filtered, unbiased model using pharmacophore scoring function): 99 compounds. Library C ( Pre-trained on ChEMBL, Use Library B to fine-tune generative model): 182 compounds. Library E (Pre-trained on ChEMBL Reinforcement learning using pharmacophore scoring function): 74 compounds.

UPDATE:

Ok, here are the SMILES @kipUNC. I took the Library A,C and E triaged dw files above, selected 30 structures whilst trying to maximise the chemical space explored in each one and then submitted those through ChemSpace. Some are commercially available, and these are described in the document below.

Triaged commercial vendors similarity chart.docx

mattodd commented 2 years ago

Great work @TomkUCL! If @kipUNC can dock the lot then we can select 10 from each library to make, and maybe some to buy.

kipUNC commented 2 years ago

Thank you! I will provide the scores by Tuesday.

TomkUCL commented 2 years ago

@kipUNC @H-agha What is the PDB code of the apo state protein that you are docking to? I will try to get some scores for these fragments in ICM pro as well.

H-agha commented 2 years ago

@kipUNC @H-agha What is the PDB code of the apo state protein that you are docking to? I will try to get some scores for these fragments in ICM pro as well.

I used the PDB for fragment 420 - (PDB ID: 5RM6). However, after we learned from Jo about the electron density (ED) of the soaked crystals. I am not sure if it is better to use other crystal structures such as (PDB:5RMJ) for fragment 645 or (PDB: 5RMG) for fragment 524. Both fragments showed better ED map.

mattodd commented 2 years ago

Hi @H-agha so wait, you mean for the docking of the potentials (above), or do you mean a more fundamental re-assessment of the pharmacophore/design? Please can you elaborate?

kipUNC commented 2 years ago

I do not think we should worry about that. 420 is still good for docking, at least as good as any other. The electron densities of the ligands are not of a concern here.

H-agha commented 2 years ago

Hi @mattodd. I mean for docking, do we need to use the crystal structure for the fragments that showed better electron density or all crystal structures will be the same and there will be no difference? @kipUNC, could you please comment on this?

For the pharmacophore, I treated all the fragments equally to be able to extract the important residues within 5A that showed interactions with the fragments. From the attached video you can see the alignment of all the fragments and the residues within 5 A. One important notice is that the side chain of R502 showed some flexibility (different orientation) with different fragments. All other residues almost have the same orientation

H-agha commented 2 years ago

Agha, Hebaalla has shared a OneDrive for Business file with you. To view it, click the link below. https://adminliveunc-my.sharepoint.com/personal/hebaa_ad_unc_edu/Documents/Attachments/Site%203-%20Fragm%20_%20residues%20alignment.zip [https://r1.res.office365.com/owa/prem/images/dc-zip_20.png]https://adminliveunc-my.sharepoint.com/personal/hebaa_ad_unc_edu/Documents/Attachments/Site%203-%20Fragm%20_%20residues%20alignment.zip Site 3- Fragm _ residues alignment.ziphttps://adminliveunc-my.sharepoint.com/personal/hebaa_ad_unc_edu/Documents/Attachments/Site%203-%20Fragm%20_%20residues%20alignment.zip

I am not able to attach the video on GitHub. So, I am sending it here.

From: Mat Todd @.> Sent: Wednesday, March 2, 2022 8:24 AM To: StructuralGenomicsConsortium/CNP4-Nsp13-C-terminus-B @.> Cc: Agha, Hebaalla @.>; Mention @.> Subject: Re: [StructuralGenomicsConsortium/CNP4-Nsp13-C-terminus-B] Triage of the Generative Suggestions (Issue #17)

Hi @H-aghahttps://github.com/H-agha so wait, you mean for the docking of the potentials (above), or do you mean a more fundamental re-assessment of the pharmacophore/design? Please can you elaborate?

— Reply to this email directly, view it on GitHubhttps://github.com/StructuralGenomicsConsortium/CNP4-Nsp13-C-terminus-B/issues/17#issuecomment-1056927823, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AWAZRIX4FIJYSZXGRTTXV5DU55TXXANCNFSM5N5XBUPA. Triage notifications on the go with GitHub Mobile for iOShttps://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Androidhttps://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub. You are receiving this because you were mentioned.Message ID: @.***>

kipUNC commented 2 years ago

It is always a trade off, so we either dock into all structures or use one and keep other in mind when do final hit evaluation. The electron densities of the fragments do not change anything about binding site selection at this point.

mattodd commented 2 years ago

Okeydoke @TomkUCL - @kipUNC has done some Glide magic and come up with top 30 per library in these two sheets (one is the code cross-reference, the other are the scores). I think the two questions are 1) Which library is seemingly performing better? (maybe colour code the sheet?) 2) How many of the top 30 are commercial vs need to be made. In an ideal world we'd have a mix of things we can order and things we can make. Can you make a Chemdraw sheet that shows that - maybe start with the top 30 and see what proportion needs to be made/bought? Again, maybe colour code somehow and indicate ones that can be bought with a clearly highlighted price or supplier?

TomkUCL commented 2 years ago

In terms of which library is the best scoring, it doesn't look clear cut to me. In the top 30 scoring compounds we have: Library A = 11 fragments Library C = 8 fragments Library E = 11 fragments

Kosta.dock.list.1.xlsx

A quick ChemSpace search is summarised here, I will summarise this in a ChemDraw for tomorrow's meeting. chemspace-search-20220303191749.xlsx

H-agha commented 2 years ago

Attached is the comparison of the docking scores for 3 different trials. Library A-C-E docking scores- comparison.xlsx

TomkUCL commented 2 years ago

@mattodd Here is a list of commercial and non-commercial fragments and some early thoughts on synthetic steps. Sorry took longer than expected as a lot of these were showing up through manual searches but not through Chemspace. I will continue to add to the synthetic routes for non-commercial fragments.

Kosta docked top 30 fragments.docx

mattodd commented 2 years ago

Really good analysis @TomkUCL. How about we work our way through it on Friday to ID the compounds that we should obviously order, those we should obviously make, and then then ones in the grey area. I guess the grey area ones are those where we might reconsider if the modeling is not strongly supportive.

TomkUCL commented 2 years ago

No problem, happy to go through these tomorrow. I was hoping to have these docked today in ICM to back up some of Kosta and Heba's scores, but currently still waiting on the license renewal. I have been trying to install Autodock Vina as a backup, but there seem to be issues installing it on Windows, so i'll likely have to wait until ICM is back up and running.

TomkUCL commented 2 years ago

@mattodd Here is a list of commercial and non-commercial fragments and some early thoughts on synthetic steps. Sorry took longer than expected as a lot of these were showing up through manual searches but not through Chemspace. I will continue to add to the synthetic routes for non-commercial fragments.

Kosta docked top 30 fragments.docx

Starting materials for 15E, 3A, 2A, and 4A have been ordered to arrive this week. According to the literature, the planned synthesis is 1 step, approximately 15 mins reaction time, so hopefully, I should be able to get these 4 compounds made by the following week.

FYI There is a range of benzaldehydes that we can swap in here (https://www.dougdiscovery.com/catalogsearch/result/?q=3-formylbenzonitrile) if the originals hit in the SPR assay and any further analogues score better in docking.

I will compile a list of compounds that will need to be bought shortly.

TomkUCL commented 2 years ago

Hi all,

I am currently trying to get some further docking scores in AutoDock Vina using Chimera for the top 30 fragment hits from Kosta's modelling to compare with the Glide scores that you have been generated so far. As a side for further usefulness, this also gives us a chance to compare the accuracy of commercial vs open-source docking programs, something that is frequently discussed on our end recently. For further reading see below:

1) Phys. Chem. Chem. Phys., 2016,18, 12964-12975 https://doi.org/10.1039/C6CP01555G

2) Pagadala, N.S., Syed, K. & Tuszynski, J. Software for molecular docking: a review. Biophys Rev 9, 91–102 (2017). https://doi.org/10.1007/s12551-016-0247-1 https://doi.org/10.1007/s12551-016-0247-1

3) Fan, J., Fu, A. & Zhang, L. Progress in molecular docking. Quant Biol 7, 83–89 (2019). https://doi.org/10.1007/s40484-019-0172-yhttps://doi.org/10.1007/s40484-019-0172-y

This is a slow process as I am a first-time docker and self-teaching, so please bare with me.

I have started with fragment 15E (E-isomer only) from the list above, as this is one of the fragments that I should be able to synthesise very quickly once the stating materials arrive. That way we can quickly (hopefully) get SPR measurements to compare against these docking scores.

I used PyMol to prep the protein for docking as shown below, removing the original bound fragment from 5RM6 as a separate PDB file. I have highlighted the residues (yellow) discussed in the original pharmacophore that generated (see https://github.com/StructuralGenomicsConsortium/CNP4-Nsp13-C-terminus-B/issues/2#issue-1027586728 .

@mattodd I assume I am correct in thinking that site 3 (yellow) is away from the ligand (orange) binding site in PDB 5RM6 (pictured below)?

For docking, I generated a small-ish grid volume as shown below, with fragment 15E shown in the top right:

I then ran this in Vina. This gave a minimised Vina score of -5.9, compared to @kipUNC Glide score of -4.978, and @H-agha scores of -4.149 (small grid) and -4.681 (larger grid).

Next, I re-ran the calculations for the same ligand (15E), except this time using a slightly larger grid volume, which gave some slightly lower scores (down to -6.9), albeit with the ligand now in a translated position.

@H-agha I see 15E ranked 60th and 31st (in terms of Glide score) in your old grid and bigger grid, respectively.

@kipUNC @H-agha I suppose it is hard to compare scores if the grid volumes are all different, so my question is:

what grid size is large enough to encompass the entire Site 3 pocket?

I am currently trying to turn this docking session into a hydrophobicity map in Chimera to see if 15E is actually sitting inside the pocket, which might help to answer this. Sadly, it gets angry at me when trying this, but hopefully, I'll have this figured out soon. I think the key is to transfer the docked ligand file back into PyMol and do it in there, as shown below for ligand 2A, which so far has the lowest Vina score (-7.0) and appears to sit nicely into the binding pocket, but I am yet to check the interactions with the residues.

If you have any feedback then please let me know as I'm still learning!

If you are happy with this approach then I will repeat the Vina docking of the other 29 ligands in the list using a large and smaller grid to compare with the glide scores.

Cheers.

H-agha commented 2 years ago

Hi @TomkUCL, Good try. I am sorry, I am not familiar with AutoDock Vina. I used only Schrodinger applications. So, I can not help with this software.

However, I have only one comment. The score of 15E using bigger grid (-6.9) is actually higher and better than the score of (-5.9) using smaller grid. The very negative score corresponds to a strong binding and a less negative or even positive score corresponds to a weak or non-existing binding.

The smaller grid I used in Glide is the default. The software define it based on the selected ligand. While the bigger grid I used was a suggestion from Kostya (@kipUNC) with dimensions (X:15, Y:15, Z:19) angstrom.

TomkUCL commented 2 years ago

Thanks @H-agha . I am comfortable with the docking procedure now I think.

So far I have been using a 20x20x20 grid centred around the residues specified in the pharmacophore. Maybe for conformity I will dock on Vina using the same grid dimensions so that we can get a direct comparison.

Can I ask which coordinates your grid is centred at?

H-agha commented 2 years ago

I am not sure what you meant by (Can I ask which coordinates your grid is centred at?) - the default settings in Glide make the grid centered around the ligand when we pick its atoms and the software define the dimensions. The bigger one (15x15x19) was a suggestion from Kostya, I think he chose these numbers so that it can cover a ligand that is bigger than the fragments to include any possible new residues in the binding site and cover a bigger ligand binding site. Kostya (@kipUNC), please correct me if I am wrong.

H-agha commented 2 years ago

Hi all, Looking at the binding modes of the best compounds and the nearby residues in the active site of site #3. I suggested some modifications to these compounds that can add more interactions with other residues in the binding site. 17 compounds showed better docking scores and new interactions. FYI, I still need to check the synthetic feasibility of these modifications. Here are the slides. Modified Structures -Library A, C, E_HA.pptx

kipUNC commented 2 years ago

Colleagues, please take a look at top 100 compounds from screening Enamine 40B library. Please double check for duplicates. nsp13_40bil_top_100.txt

kipUNC commented 2 years ago

Ok here are two DataWarrior files:

clustering_struct_score_1000.dwar -- files where I clustered top 1000 compounds including multiple scores (multiple poses) for same compounds. This information can give us an idea if some compounds have multiple poses that score well. Maybe an indication of a "goodness" of the compound. Clusters/compounds are colored by the score, so we want more green.
cluster_centroids_100.dwar -- top 100 unique compounds with the scores.

Ok doesn't look like I can attach dwar format here. e-mailing...

H-agha commented 2 years ago

Thanks, Kostya!

From: Konstantin @.> Sent: Friday, April 1, 2022 2:57 PM To: StructuralGenomicsConsortium/CNP4-Nsp13-C-terminus-B @.> Cc: Agha, Hebaalla @.>; Mention @.> Subject: Re: [StructuralGenomicsConsortium/CNP4-Nsp13-C-terminus-B] Triage of the Generative Suggestions (Issue #17)

Ok here are two DataWarrior files:

clustering_struct_score_1000.dwar -- files where I clustered top 1000 compounds including multiple scores (multiple poses) for same compounds. This information can give us an idea if some compounds have multiple poses that score well. Maybe an indication of a "goodness" of the compound. Clusters/compounds are colored by the score, so we want more green.
cluster_centroids_100.dwar -- top 100 unique compounds with the scores.

Ok doesn't look like I can attach dwar format here. e-mailing...

— Reply to this email directly, view it on GitHubhttps://github.com/StructuralGenomicsConsortium/CNP4-Nsp13-C-terminus-B/issues/17#issuecomment-1086228962, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AWAZRIQEIK3IAW2NCFGA7R3VC5BINANCNFSM5N5XBUPA. You are receiving this because you were mentioned.Message ID: @.***>

TomkUCL commented 2 years ago

Thanks @kipUNC - will look over this before monday's meeting!

StructuralGenomicsConsortium / CNP4-Nsp13-C-terminus-B

Triage of the Generative Suggestions #17