Analysis of Generative Modelling Hit Libraries

TomkUCL commented 2 years ago

This issue contains the lists of small molecules developed by Konstantin at UNC using generative modelling, along with PowerPoint slides of the methods used to generate the list. By trying several different generative models, we hope to identify common molecules amongst these lists, which we hope indicate promising molecules to be made in the medicinal chemistry laboratory at UCL.

TomkUCL commented 2 years ago

The initial five generative libraries from @kipUNC are as follows: nsp13_generative_1 .pdf

Library A: Pre-trained on ChEMBL. Hits from a filtered, unbiased model using a pharmacophore scoring function. ~15k structures) Library_A_hits.csv

Library B: Virtual screening hits from Enamine library (20M molecules) ,~15k structures. Library_B_hits.csv

Library C: Pre-trained on ChEMBL. Used Library B to find-tune the generative model, filtered generative compounds for hits. ~15k structures. Library_C_hits.csv

Library D: Pre-trained on Enamine library (20M molecules). Hits from a filtered, unbiased model using a pharmacophore scoring function. ~1.5k structures.
Library_D_hits.csv

Library E: Pre-trained on ChEMBL. Reinforced learning using pharmacophore scoring function. ~15k structures. Library_E_hits.csv

We can chemically visualize these structures from their SMILES strings automatically in Datawarrior (data warrior useful here? https://openmolecules.org/datawarrior/index.html).

Next, hope to identify any common motifs within these five libraries. Clustering these molecules by structural similarity in Data Warrior will hopefully allow us to reduce the number of molecules down to reasonably sized lists. We can then determine any structures/ motifs common to each list, then look at docking and scoring these manually, before synthesizing any promising candidates.

TomkUCL commented 2 years ago

All five generative SMILES string libraries (A-E) were combined into a single Excel sheet, and an 'analyse similarity' search was run on DataWarrior (DW) for all of the small molecules together. See the zip files below in DW. Discussion of the clusters will be posted within this thread. Library_A-E_hits.zip

Below is a screenshot of the DW similarity chart; there are a few clusters in there (like the ureas, which are common motifs amongst antivirals). The current objectives are: a) Eliminate molecules that don't make chemical sense (like the one shown in the bottom right) to shorten the list. b) To compile the similarity chart as a list that shows how many times the remaining molecules a repeated amongst the five generative libraries (e.g. n = 5, meaning that compound was identified by all five generative models, or n = 1, meaning that structure was identified by only one model.) c) To check amongst the remaining molecules to identify which ones are novel that we can make.

Help would be appreciated for the following tasks:

1) Which software is best suited to identifying compounds that don't make chemical sense? Can this be done in Datawarrior itself? I have seen Filter (v.2.5.1.4; Openeye Scientific Software, Santa Fe) being used to remove invalid chemical structures from commercial fragment libraries (DOI: 10.1039/d1md00363a).

2) How to convert the similarity chart to a list.

3) How to export the list of compounds in DW as 2D structures for manual inspection.

UPDATE 27/01/2022

It looks like the easier route involves first identifying the duplicate SMILES strings in MS Excel, and then transferring those duplicate strings into data warrior for similarity analysis. I have done this using the following method (further details found in this link: https://www.ablebits.com/office-addins-blog/2016/03/09/how-to-highlight-duplicates-excel/#:~:text=To%20use%20this%20rule%20in%20your%20worksheets%2C%20perform,To%20apply%20the%20default%20format%2C%20simply%20click%20OK.):

1) Combine all Libraries (A-E) into a single Excel spreadsheet. 2) Use the 'highlight duplicates' function in Excel to identify the SMILES strings. To do this, follow the following path: a) On the Home tab, in the Styles group, click Conditional Formatting > New rule > Use a formula to determine which cells to format. b) In the Format values where this formula is true box, enter a formula similar to this: =COUNTIF($A$2:$A2,$A2)=5
Where A2 is the top-most cell of the selected range. c) Click the Format… button and select the fill and/or font colour you want. d) Finally, click OK to save and apply the rule.

From these repeat structures, I will eliminate invalid chemical structures before performing similarity analysis in DataWarrior. This should give us a significantly smaller list of chemically valid structures that are an output of all five generative libraries (A-E).

mattodd commented 2 years ago

Nice @TomkUCL! So does that spit out the ones that are common to all 5 libraries? And in the Datawarrior screenshot above, do the blobs of colour indicate a bunch of molecules that are similar? i.e. we could aim to make representatives from each blob? With so many molecules to choose from, it's hard to know which to choose @kipUNC Given a large selection of predictions, how to make a representative set @drc007?

TomkUCL commented 2 years ago

Nice @TomkUCL! So does that spit out the ones that are common to all 5 libraries? And in the Datawarrior screenshot above, do the blobs of colour indicate a bunch of molecules that are similar? i.e. we could aim to make representatives from each blob? With so many molecules to choose from, it's hard to know which to choose @kipUNC Given a large selection of predictions, how to make a representative set @drc007?

The green blobs represent those with an 80% similarity based on the structure similarity search. I've clicked on one cluster as an example, otherwise they are uniformly green like this...

As for spitting out identical structures amongst all five libraries, initially it seemed to highlight only those SMILES that are in all five libraries in red so that they are easy to spot. However, when I tried to manually check these by searching for the SMILES string in the Excel Search Bar, it was showing only two or three instances where that SMILES was repeated in the sheet in some cases.

I believe there is a formula capable of doing this in Excel, but Jemima and I are still trying to find the correct one. Another possibility based on how long scrolling through the Excel sheet takes, that Excel might not be capable of handle a data set this large (i.e. it is trying to apply that 'check duplicate' formula to ~59'000 SMILES strings simultaneously).

@jemimahaque and I are currently trying to find a way around this, but if anyone has suggestions that would speed things up a lot for us. I think are options are either to: a) compare the libraries two at a time (which would be very slow), b) to find a similar method within Excel, or c) to find software that is capable of finding five identical SMILES on a data set that large between five different libraries.

@edwintse Do you know if DataWarrior is capable of doing that? (i.e. can we do a similarity search between five datasets (libraries) set at 100% similarity, so that it will either list or cluster identical structures common to all five libraries?)

drc007 commented 2 years ago

@TomkUCL I've written a script for Vortex in python that does exactly this (https://www.macinchem.org/reviews/vortex/tut28/scripting_vortex28.php) I've run it on 10 million structures and it takes a few mins. Unfortunately I don't think DataWarrior has a scripting interface. If you can share the files I can do this for you. I also have a script that can compare multiple datasets.

TomkUCL commented 2 years ago

@TomkUCL I've written a script for Vortex in python that does exactly this (https://www.macinchem.org/reviews/vortex/tut28/scripting_vortex28.php) I've run it on 10 million structures and it takes a few mins. Unfortunately, I don't think DataWarrior has a scripting interface. If you can share the files I can do this for you. I also have a script that can compare multiple datasets.

That's super helpful - thanks :) I've attached the Excel files for the five libraries here.

A-->E Separate: Library_A_hits.csv Library_B_hits.csv Library_C_hits.csv Library_D_hits.csv Library_E_hits.csv

A->E Combined: Library_A-E_hits.csv

drc007 commented 2 years ago

@TomkUCL It looks like the file Library_A_hits is identical to Library_D_hits

drc007 commented 2 years ago

@TomkUCL I think Library_A_hits.csv is actually a copy of the Library_D_hits file contents. I can probably extract the correct data for Library_A_hits from the A->E combined file. To avoid any any future errors it might be better to repost Library_A_hits.csv and edit the comment above.

mattodd commented 2 years ago

Hi @kipUNC - were two of the files accidentally duplicates? (see above).

drc007 commented 2 years ago

@mattodd @TomkUCL @kipUNC Since the correct data is in the A->E combined file I suspect a file has been mislabelled at some point.

drc007 commented 2 years ago

I've exported the Library_A_hits data from the A->E combined file and done a comparison of all libraries as shown below. CommonStructures The numbers with the red background are the number of compounds in each library so Library_B_hits contains 14095 unique structures. The numbers with the green background are number of identical compounds. So looking down the first column, Library_B_hits there are 0 in common with Library_E or A, 1 in common with library_C and 32 in common with Library_D.

TomkUCL commented 2 years ago

@drc007 Libraries A and D are opening as different files to us. These are the same files that I received from @kipUNC. The SMILES strings are different on our end.

TomkUCL commented 2 years ago

I've exported the Library_A_hits data from the A->E combined file and done a comparison of all libraries as shown below. The numbers with the red background are the number of compounds in each library so Library_B_hits contains 14095 unique structures. The numbers with the green background are number of identical compounds. So looking down the first column, Library_B_hits there are 0 in common with Library_E or A, 1 in common with library_C and 32 in common with Library_D.

That's excellent, thanks. To confirm @drc007, this is telling us that there are no SMILES strings that are common to all five generative libraries? Is there a way to list the common SMILES as 2D structures?

drc007 commented 2 years ago

These are the fist few lines from the file download from the link above, as you can see Library_A_hits.csv clearly has data from LibraryD. Screenshot 2022-01-28 at 10 16 07

drc007 commented 2 years ago

I have a videoconf now, but will annotate the combined file to flag duplicates later.

TomkUCL commented 2 years ago

Hmm that's strange... the Excel files are different when Jemima and I have opened them on separate desktops, so not sure what the problem is here..

drc007 commented 2 years ago

Also library_B_hits.csv only has the SMILES no identifier Screenshot 2022-01-28 at 10 22 13 .

drc007 commented 2 years ago

Can you check the actual files on GitHub?

TomkUCL commented 2 years ago

The above screenshot is from opening from the GH link - @edwintse can you try opening from the GH links above and tell us what you get? Thanks

drc007 commented 2 years ago

Might be better to download from links above and then compare

drc007 commented 2 years ago

@TomkUCL When I open the combinedA->E file in Excel the identifiers for LibraryB are missing, as seen below Screenshot 2022-01-28 at 11 43 51

Do you have a version that includes them?

TomkUCL commented 2 years ago

@kipUNC do you have a copy of Library B with identifiers in column B? (see above).

drc007 commented 2 years ago

@TomkUCL The attached file contains all structures, identifier (where available) and the InChiKey. It also contains a field containing a duplicate flag. If you open this file in DataWarrior, Vortex or similar and then sort on the InChiKey field you can then browse don and see duplicates flagged. You can also filter by duplicate to just see the molecules that appear in multiple libraries. export.sdf.zip

TomkUCL commented 2 years ago

@drc007 Thanks. Does the duplicate flag indicate those molecules that appear in all five libraries (A through E), or just that they appear in another library other than A?

drc007 commented 2 years ago

@TomkUCL @mattodd @kipUNC An update.

I've added Name to Library B calculated from row number (first record is row 0) and adding _Library_B.

There are a few structures where Vortex can't interpret the SMILES string to give a sensible structure. Screenshot 2022-01-28 at 14 17 20

For the rest I calculated the InChi key and ran a vortex script that compares InChiKeys to determine if structures are duplicates. I don't compare SMILES because they are not unique for a given structure.

For duplicates it adds a duplicate flag. If you now sort by InChiKey any duplicate structures should appear in adjacent rows as shown below.

Screenshot 2022-01-28 at 14 18 04

In this case the structure appears in both Library A and in Library E, there are 641 structures that appear in multiple libraries. No structures appear in all 5 libraries or 4 libraries. There are 47 Compounds that appear in 3 libraries.

drc007 commented 2 years ago

Here is the file containing all the data (with Library B names). AllLibrariesExport.sdf.zip

drc007 commented 2 years ago

@TomkUCL Here an updated file, I've added annotation flagging PAINS and other functional groups that might be a concern. I've also added columns with PubChem and ChEMBL ID and some vendor information. AllLibrariesAnnotated.sdf.zip

TomkUCL commented 2 years ago

@drc007 Is it possible to add the file here in which the adjacent Duplicates are colour-coded? I am able to see which ones are duplicates, but it would be nice to colour code by those that appear in 3 libraries and 2 libraries as I have started to do in Excel below. I can do this manually but I was hoping you might know a quicker way of doing this?

DW:

Excel:

drc007 commented 2 years ago

@TomkUCL I'll have a look

drc007 commented 2 years ago

@TomkUCL Colour coding obviously does not transfer between applications so I've added an additional column called "Multiple". In this I've tagged molecules that appear 3 times with the number 3, molecules which appear twice with the number 2. You should be able to add conditional formatting based on the number. AllLibrariesAnnotated2.sdf.zip

TomkUCL commented 2 years ago

@drc007 That's brilliant - thanks very much Chris 👍

The corresponding Excel is here in case we need it Alllibrariesannotated_1.xlsx

@mattodd @kipUNC I can check the 141 structures present in 3 libraries (A,C,D, & E) for commercial availability. I assume the most reliable way of searching is through the CAS number via the 2D structure. Manually this is obviously quite slow, so I wondered if you can suggest a quicker method for searching? @edwintse do you have any recommendations? I'm assuming this can be done directly in data warrior without needing to go through ChemDraw?

TomkUCL commented 2 years ago

It looks like there are a few compounds in the latest list that don't have an associated ZINC code that are also present in 3 libraries. I will search those compounds by SMILES in ZINC Database and by 2D structure in SciFinder for commercial sources to determine whether these compounds are commercially available.

Quick note for clarification - each molecule is repeated three times in the list as shown, so there are actually 47 molecules (not 141) present in 3 libraries (excluding library B, the ENAMINE library).

drc007 commented 2 years ago

@TomkUCL That is correct 141 entries but 47 unique molecules.

TomkUCL commented 2 years ago

Here are the 10 compounds identified from @kipUNC generative libraries that are present in 3 different libraries and with no associated ZINC code. From a quick SciFinder search, it looks like 4 molecules are not available (N/A) commercially, but it will be worth a double-check. Of these 4 compounds, 2 have reported routes and 2 do not (again further checking to be sure on this to follow). @H-agha Are you happy to dock these and get a Glide score?

The SMILES strings are: Clc1ccc(cc1)CNC(=O)Nc1ccc(cc1)Br Fc1ccc2c(c1)/C(=C\c1ccc(cc1)Cl)C(=O)N2 Fc1ccc(cc1)C1CC(=O)N1c1ccc(cc1)F NC(=O)c1ccc(cc1)NC(=O)Nc1ccc(cc1)Br [O-]N+c1ccccc1NC(=O)Nc1ccc(cc1)Cl Clc1ccc(cc1)NC(=O)CNC(=O)c1ccccc1Cl CN(C)c1ccc(cc1)/C=C\1/C(=O)Nc2ccc(cc21)F Clc1ccc(cc1)C1=C(C(=O)NC1=O)c1ccc(cc1)Cl Fc1ccc(cc1)C1=C(C(=O)NC1=O)c1ccc(cc1)F COc1ccc(cc1OC)/C=C(\C#N)C(=O)Nc1ccc(cc1)Cl

drc007 commented 2 years ago

@TomkUCL Some of these look trivial to make. It might be worth search Enamine REAL to see if they are available as "make on demand". https://enamine.net/compound-collections/real-compounds/real-database

TomkUCL commented 2 years ago

@drc007 thanks, I've just sent off a custom request to Enamine.

H-agha commented 2 years ago

Here are the 10 compounds identified from @kipUNC generative libraries that are present in 3 different libraries and with no associated ZINC code. From a quick SciFinder search, it looks like 4 molecules are not available (N/A) commercially, but it will be worth a double-check. Of these 4 compounds, 2 have reported routes and 2 do not (again further checking to be sure on this to follow). @H-agha Are you happy to dock these and get a Glide score?

The SMILES strings are: Clc1ccc(cc1)CNC(=O)Nc1ccc(cc1)Br Fc1ccc2c(c1)/C(=C\c1ccc(cc1)Cl)C(=O)N2 Fc1ccc(cc1)C1CC(=O)N1c1ccc(cc1)F NC(=O)c1ccc(cc1)NC(=O)Nc1ccc(cc1)Br [O-]N+c1ccccc1NC(=O)Nc1ccc(cc1)Cl Clc1ccc(cc1)NC(=O)CNC(=O)c1ccccc1Cl CN(C)c1ccc(cc1)/C=C\1/C(=O)Nc2ccc(cc21)F Clc1ccc(cc1)C1=C(C(=O)NC1=O)c1ccc(cc1)Cl Fc1ccc(cc1)C1=C(C(=O)NC1=O)c1ccc(cc1)F COc1ccc(cc1OC)/C=C(\C#N)C(=O)Nc1ccc(cc1)Cl

Hi @TomkUCL I am happy to dock them. For easier handling, I'd like to name these compounds to avoid further confusions as follow.

N.B: Not sure why Cpd K5 was not recognized in Maestro, so I redraw it to get the smiles from chemdraw and posted to Maestro and worked.

K5 = O=C(Nc1c(N+=O)cccc1)Nc2ccc(Cl)cc2

drc007 commented 2 years ago

@TomkUCL @H-agha All the compounds have unique ID from the generative model. It might be better to use that to avoid any problems tracing back the source of a compound.

I suspect the problem is GitHub markdown. It would be better use a plain text file (SMILES or SDF) to exchange structural information.

H-agha commented 2 years ago

@TomkUCL @drc007 Until we get the proper naming of these compounds, here are the scores and the PyMOL session of the 10 compounds identifies by @kipUNC compared to the scores of the original fragments.

K series poses.pse.zip

TomkUCL commented 2 years ago

Thanks @H-agha . I have a couple of questions about this...

What are we considering as a 'good' glide score for these molecules, considering that these are fragments rather than drug-like molecules?

Do the interaction maps indicate any key residue binding motifs in this K set that we have not considered before?

TomkUCL commented 2 years ago

This is from 2019, so not sure if this is still a relevant/similar approach for generative modeling from the pharmacophore, but maybe another method for new molecule generation? At least another perspective maybe... Drug Analogs from Fragment-Based Long Short-Term Memory Generative Neural Networks http://pubs.acs.org/action/showCitFormats?doi=10.1021/acs.jcim.8b00902

TomkUCL commented 2 years ago

@mattodd @kipUNC From today's update meeting (11th Feb 2022), we have agreed want to filter the above libraries A, C, E (excluding B and D) in Data Warrior compounds based on:

A) Molecular weight i.e. those which are 2-3 heavy atoms (C or N) / 40-60 Da heavier than the pharmacophore fragments highlighted by Heba in issue #2.

B) Log P less than (< 3) using Data Warrior for improved hydrophilicity.

This topic will be followed up in issue #17

kipUNC commented 2 years ago

Correct. Lets see how many are satisfying those criteria and how diverse those molecules are.

TomkUCL commented 2 years ago

@kipUNC could you upload your latest generative workflow slides below here? thanks

TomkUCL commented 1 year ago

@kipUNC can you please explain where the de-novo N-oxides originated? I'm having a difficult time relating these earlier libraries to the N-oxide targets. Would be good to clarify. Thanks.

StructuralGenomicsConsortium / CNP4-Nsp13-C-terminus-B

Analysis of Generative Modelling Hit Libraries #15