OpenSourceMalaria / Series4_PredictiveModel

Can we Predict Active Compounds in OSM Series 4?
7 stars 10 forks source link

New Series 4 candidates based on generative model - EOSI #34

Open miquelduranfrigola opened 3 years ago

miquelduranfrigola commented 3 years ago

Hello @mattodd @edwintse,

At @ersilia-os we have tried to generate new Series 4 candidates. In short, we provide two tables:

For a first assessment of the results, you can check this dynamic visualization of the selected 1k candidates. If a cluster is of particular interest, please refer to the full results to discover other similar molecules. You can also check a tree map of all molecules.

Our generative model approach is based on Reinvent 2.0. We have implemented several reinforcement-learning agents, aimed at optimizing activity and other desirable properties. This GitHub Repository contains more detailed information and source code.

This is the first time we run a generative model, so please bear with us. We will be more than happy to optimize further runs based on your feedback.

Thanks! @GemmaTuron @miquelduranfrigola

edwintse commented 3 years ago

Hi @miquelduranfrigola, thanks for all these great suggestions!

miquelduranfrigola commented 3 years ago

Hi @edwintse thanks for the feedback.

To answer your points

Suggested way forward

We will:

Now the most important is that we define the constraints for the generative model. These 3 are clear:

  1. Activity
  2. Chemical diversity / distance to known series 4 compounds
  3. SLogP

Anything else? For example:

It will be very useful to know precisely your ideal profile of properties. Please let us know and we will try to implement it.

mattodd commented 3 years ago

Hi @miquelduranfrigola. Very interesting, and a great way to visualise the suggestions.

I'd definitely agree with you re the top three filters. For logP we tend to want to focus on compounds <3.5, very roughly. Synthetic accessibility is a "nice to have". We can of course do this as humans, but it can take a while with so many suggestions, and as we mention in the preprint about the previous competitions, we had to ditch some interesting possibilities since they would have taken too long to do. So that's a soft "yes" as a feature. Molecular weight would always generally be below 500, but that's just typical. For the others, they're not so important, since we can always try to engineer out problematic motifs (like aromatic amines) if they are found to be potent.

To consider the shortlist, though, it would seem that, with those constraints applied, we (me, @edwintse, other potential contributors) need to simply browse through and isolate structures that, well, take our fancy. I mean, there are a lot of possibles here.

Pinging @drc007 purely because I think you'll get a kick out of this.

drc007 commented 3 years ago

@miquelduranfrigola, This is really interesting. It would also be useful to have a measure of the "confidence" in the predicted activity model.

Can you also identify which new molecules would add the greatest amount of new information to your predicted activity model?

edwintse commented 3 years ago

Hi @miquelduranfrigola, if you're able to incorporate all the filters @mattodd mentioned that'd be great. I suppose we'd be looking at getting the list down to 100 compounds, then we can quickly look through and pick a couple out to make.

Just on the tree map visualisation tool, we were just curious about how the compounds were placed throughout the tree and if there was anything particularly significant about the different red clusters? I guess it's a bit hard to pick out certain compounds within the clusters or to know which are the "best".

miquelduranfrigola commented 3 years ago

Thanks, @mattodd @drc007 and @edwintse for your comments! This will help moving forward. In the following days, @GemmaTuron and I will give it a push.

We will try to:

In addition, I will provide a deeper explanation of the TMAP. I guess we will select the top 100 candidates based on "red regions", so hopefully this will address @edwintse's good point about how to navigate this map.

As for @drc007's suggestion to identify what molecules would add more information to future models... very interesting, didn't think of this! I don't have an immediate answer, but we will try to address the point. Perhaps, to start with, we could see what molecules would expand more efficiently the applicability domain?

edwintse commented 3 years ago

Hi @miquelduranfrigola, we were just wondering whether it's possible to add a substructure search function to the dynamic visualisation tool? I guess we'd want to be focusing on structures that have meta or para substituents on the RHS phenyl ring as the more interesting ones to pursue.

GemmaTuron commented 3 years ago

hello @edwintse, I was actually just having a closer look at the most desirable substituents according to the information on the wiki and series 4 paper. We are trying to refine the molecule generator these days, it would be great if you can give us some hints about the most desirable substituents, also taking into account what you have observed in terms of HLM and RLM. As for the display, can try to play a bit using SMARTS structures and see if we can incorporate this in the visualization tool. Will let you know if it works.

GemmaTuron commented 3 years ago

Hello @edwintse, sorry for the delay! I have updated the app visualization to provide some substructure search capabilities. As you mentioned you are interested in the RHS substituent I have added the following select-boxes: Heteroaryl: when selected, displays all RHS substituents composed of an aryl (including phenyl) Phenyl: when selected, displays RHS substituents containing strictly phenyls (no heteroatoms) Para/Meta/Orto options allow to select compounds with para-, meta- or orto- substituents on the phenyl. Let me know if this is useful or you were thinking of different filters.

edwintse commented 3 years ago

@GemmaTuron Wow, that's amazing and super useful to narrow things down!

edwintse commented 3 years ago

Hi @GemmaTuron, I've been trying to make some compounds suggested by Evariste recently (#29) and was wondering if you guys ever generated any structures containing structures similar to those in this comment with indole/benzimidazole type groups on the RHS (or even any of the other structures that they predicted)? It would be interesting to see if there was any overlap between your suggestions and those from Evariste.

miquelduranfrigola commented 3 years ago

Hi @edwintse @mattodd @drc007

We are preparing a new batch of generated molecules. We will get back to you shortly. Good idea, Edwin, we will check overlap with molecules from Evariste. Thanks!

Meanwhile, @GemmaTuron and I have prepared a small app where you can input your molecules of interest and will get some activity predictions according to a few simple ML models. Perhaps this is useful if you have some candidates from our lists or others or want to try small modifications on those molecules. Feedback most welcome!

Many thanks!

edwintse commented 3 years ago

The app is amazing! We've just had some new suggestions come through from Evariste (#29) so it's already been very useful for cross-checking between the predictions.

GemmaTuron commented 3 years ago

Hello @mattodd @edwintse,

As mentioned earlier, with @miquelduranfrigola we have done a second round of molecule generation. A detailed description of the process can be found in this repo: https://github.com/ersilia-os/osm-series4-candidates-2.

In summary, we created a list of >400k candidate molecules that have undergone successive rounds of selection based on activity prediction, desirable physicochemical properties and synthetic accessibility scores. Finally, we have selected the best 90 compounds according to its predicted activity against P. Falciparum. The molecules can also be visualized in this app

Exploration vs exploitation

You will probably see that these candidates are considerably different from your known series 4 dataset. This is because we have worked in “exploration” mode, i.e. we explore regions of the chemical space that are distant to the existing compounds. We hope that this collection nicely complements with the compounds discovered in issue #29

Metrics

IC50Pred: the lower the better. It is probably biased towards high values, so hopefully it is a conservative estimate. DeepActivity: the higher the better. It is a composite z-score between several deep learning scores (chemprop, grover; trained on classification and regression tasks).
Aside from these two metrics, there are a bunch of physchem properties (MolWt, SlogP, Number of Rings, Heavy Atom Count…) and synthetic accessibility (SA, RA and Syba) scores that can be used to refine the search. As in the previous round, we have now included columns to select molecules with RHS radicals including para, orto or meta substituents. Let us know if any of these molecules look interesting!

miquelduranfrigola commented 3 years ago

Hi @GemmaTuron, I've been trying to make some compounds suggested by Evariste recently (#29) and was wondering if you guys ever generated any structures containing structures similar to those in this comment with indole/benzimidazole type groups on the RHS (or even any of the other structures that they predicted)? It would be interesting to see if there was any overlap between your suggestions and those from Evariste.

Hi @edwintse as you can see in the comment above by @GemmaTuron we have done a second round of generative models. To (sort of) answer your question, here two quick-and-dirty PCA plots (done with Morgan fingerprints) comparing:

  1. Known inactives (only left plot)
  2. Known actives
  3. Compounds in issue #29 (i.e. done in "exploitation" mode)
  4. Our 90 selected compounds (i.e. done in "exploration" mode).

126868711-f834d617-6a0d-44c5-927f-11abd36541b7

As you can see, we have a couple of compounds that cluster together with Evariste's compounds.

mattodd commented 3 years ago

OK @miquelduranfrigola @GemmaTuron this is most interesting. To make sure I understand:

The "exploitation" compounds are compounds you're predicting to be active that are derived fairly directly from other actives. The "exploration" compounds are those where you're intentionally trying to stay within the clusters of actives, and away from the inactives, yet which are sampling different areas of chemical space. So, in the left hand plot above we see no red Exploration compounds in regions where there are green inactives. In the right hand plot we're seeing exploration compounds peppering the space of known actives but in a much more diverse cloud than the purple Exploitation compounds.

Is the right hand plot meant to look like a zoom in to an area of the left hand plot? I couldn't quite map the two. I'm guessing the axis units are arbitrary, or relative? I was trying to use that as a guide.

If this is all correct (?) then we're going to need to take a look at the Exploration structures more closely. That you've factored in synthetic accessibility is a major plus there.

GemmaTuron commented 3 years ago

Hi @mattodd ,

The exploitation compounds plotted are the ones predicted by Evariste, we have used an "exploratory" generative model, and as you mention, we are trying to stay close to the actives but querying different areas of the chemical space. Your interpretation of right and left graphs is correct, this is a PCA representation so axis units are indeed arbitrary. The PCA was made once with the four datasets (left) and calculated again for the three datasets (right) so the right is not exactly a zoom of the left one. What is interesting is that some of our compounds (red) overlap with the chemical space of the Evariste compounds (purple), a good signal that these have potential strong activity. The rest of our predicted compounds (red) are interesting because they differ a bit from known actives and have been optimized not only by activity but alsosolubility, accessibility etc. You can explore the 90 selected compounds we have produced here, which includes several estimates of Synthetic Accessibility.

Hope this clarifies a bit more !

edwintse commented 3 years ago

@miquelduranfrigola @GemmaTuron We're a bit curious about the compounds that cluster with the Evariste ones. It seems like there's only a few red dots within the purple cluster. Were you able to give a zoom in on that region and show the exploration structures? I guess those would be the ones we'd prioritise if we were to make any.

miquelduranfrigola commented 3 years ago

Hi @edwintse these are the two molecules that in the PCA plot cluster together with Evariste compounds:

two_molecules-01

A few disclaimers and thoughts:

I hope this helps! Miquel

edwintse commented 2 years ago

Hi @miquelduranfrigola @GemmaTuron, just checking in to see how the compound generation is going? I've finished making and purifying the compounds from Evariste (#29) and will have them tested soonish, but we were hoping to start planning starting materials from your compounds that we might need to make or purchase.

GemmaTuron commented 2 years ago

Hi @edwintse we have started working on it, we hope by the end of next week to be able to share some news!

GemmaTuron commented 2 years ago

Hello @edwintse and @mattodd !

We have a final list of candidates (35 molecules) + an extended list of alternatives (1200 molecules). They all have high predicted potency, so perhaps now we can choose the ones with easier synthetic route and other interesting characteristics like solubility. In the files we provide a list of smiles and their predicted IC50, probability of being active with a cut-off of 1uM and probability of being active with a cut-off of 2.5uM.

All data and code is available in this repository. In short, we have:

We provide the 35 highest active predicted molecules from the list of 90 as putative candidates for synthesis, but we can also try to refine the search and enrich the list with candidate molecules from the also highly predicted actives list of 1295 molecules.

Let us know your thoughts on these molecules and if there is any extra filter you would like to add before choosing the ones to be synthesized.

mattodd commented 2 years ago

OK, great @GemmaTuron. So, @edwintse (or Gemma) can you parse into a picture so that we can see roughly what starting materials we might be looking at in the most general sense? e.g. if there's a gram needed of the core Series 4 scaffold?

GemmaTuron commented 2 years ago

Hi @mattodd I created a small .html file showing the molecules as well as their smiles and predicted activity. To browse the list of the 35 selected ones, download and unzip this folder and open the .html in a browser. Hope it helps!

edwintse commented 2 years ago

Thanks for all the new compounds @GemmaTuron! @mattodd I've drawn out all the compounds in order of predicted IC50 (left to right, top to bottom). I did a quick availability search for everything

Ersilia2022

Ersilia2022 chemdraw.zip

edwintse commented 2 years ago

Hi @GemmaTuron, based on the analysis above, we were thinking that since the majority of compounds have the 4-OCHF2 substitution on the RHS phenyl ring that that would be the most straightforward to work with. Are you able to put the following compounds through your model to see how they fair? (they are combinations of the green alcohols that are readily purchasable with the OCHF2 core)

Untitled Wiley-7

ClC(S1)=CC=C1COC2=CN=CC3=NN=C(C4=CC=C(OC(F)F)C=C4)N32 FC(F)OC(C=C1)=CC=C1C2=NN=C3C=NC=C(OCCC(F)(F)F)N32 FC(F)OC(C=C1)=CC=C1C2=NN=C3C=NC=C(OCC4=CC=C(C(F)(F)F)C=C4)N32 FC(F)OC(C=C1)=CC=C1C2=NN=C3C=NC=C(OCC4=CC=CC(C(F)(F)F)=C4)N32 FC(F)OC(C=C1)=CC=C1C2=NN=C3C=NC=C(OCC4=CC=CC=C4C(F)(F)F)N32 ClC1=CC2=C(OCO2)C=C1COC3=CN=CC4=NN=C(C5=CC=C(OC(F)F)C=C5)N43

GemmaTuron commented 2 years ago

Hi @edwintse! Thanks for checking the availability of purchasable compounds. Your suggestions seem very interesting, the second is the "less" active (still, predicted IC50 between 1 and 2.5 uM) and the rest are all < 1uM I am attaching a .csv file containing the probability of a molecule of being active (1) when:

You will see that the second molecule is the lowest scoring in the model that uses the 2.5 uM cutoff but is still predicted active, whereas the more restrictive cut-off predicts it as inactive. The rest all look highly active. I am also adding the predicted activity in uM.

osm_sugg.csv

mattodd commented 2 years ago

Very interesting - that obviously makes some of the synthesis more straightforward. @edwintse is also looking into whether some of the interesting building blocks might be available through synthesis/purchase. I like some of the (sadly red) "constrained" ones like EOS-12 and EOS-65. EOS-19 is more accessible though, so we can address that suggestion in that way perhaps.

edwintse commented 2 years ago

@mattodd Here are the predicted IC50s of the combination compounds in increasing order. The prices of the respective alcohols are also shown.

Underneath that are the synthetic routes to get the orange alcohols. I've gotten prices from Chemspace which are shown under the final alcohol. Only the last 2 had quotes from Mcule but I'm currently waiting for a reply. The prices of the starting materials are also shown. None of the chemistry looks particularly difficult.

Untitled Wiley-8

miquelduranfrigola commented 2 years ago

This looks great @edwintse and @mattodd ! Many thanks. Please let us know if you need further feedback on our side. In particular, if you come up with easier-to-synthesize/cheaper analogues, we will be happy to run predictions for them as @GemmaTuron did a few days ago.

mattodd commented 2 years ago

Great stuff @edwintse.

Top row - order the lot, except the last one. Second row: don't know. Quite a lot of effort for a phenol that I bet won't be active. If the reagent in the first step is not nasty, then maybe it's worth making. Third row - nice, make Fourth row - nice, make, but we do lose some of the logP advantage of benzylic OHs etc (like EOS-12), but still a nice idea. Fifth row - nice, make Alkyne - would be nice to buy. Any cheaper with F rather than Cl (available from Biosynth - get if predicted to be OK by @GemmaTuron )? Or smaller bottle size? Final thiophene - buy!

GemmaTuron commented 2 years ago

Hi! Just wanted to post here that we have been awarded a small grant by the Rosetrees Trust to test and optimize some of the compounds proposed here, so looking forward to the first experimental results! @edwintse @mattodd any other test you want us to run before that?

edwintse commented 2 years ago

Hi @GemmaTuron @miquelduranfrigola @mattodd, just updating on where I'm at currently with all the synthesis.

Ersilia update

Ersilia update chemdraw.zip

GemmaTuron commented 2 years ago

Hi @edwintse, Sorry for the delay in getting back to you. The molecule you suggest: FC(C(C)=C1)=CC=C1C#CCOC2=CN=CC3=NN=C(C4=CC=C(OC(F)F)C=C4)N32 has also a good prediction (0.61of probability of being active and predicted IC50 0.47) Is there any other molecule you want me to run? I can do it today or tomorrow.

MFernflower commented 2 years ago

@GemmaTuron this was an idea I had related to the compound above:

C(C1C=CC(C#CCOC2=CN=CC3=NN=C(C4=CC=C(OC(F)F)C=C4)N23)=CC=1)#N

GemmaTuron commented 2 years ago

Hi @MFernflower Thanks for the suggestion. The activity prediction is a bit lower:

edwintse commented 2 years ago

Hi @GemmaTuron @miquelduranfrigola - I have just sent the finished compounds for evaluation today. There were 3 remaining compounds below the horizontal line that I didn't manage to finish off but could be sent with the next round of compounds

edwintse commented 1 year ago

Hi @GemmaTuron @miquelduranfrigola @mattodd. We've received the results from the compounds that I sent for testing as you can see below. These are from duplicate experiments with the two values on the 4th line, and the average potency at the bottom. There's some quite nice results, especially with EGT 580-1 at 77 nM

Ersilia Results Ersilia Results Chemdraw.zip

mattodd commented 1 year ago

Wow, interesting data @edwintse @GemmaTuron. The 77nM compound is indeed good. I also like the 75-70-74 CF3 sequence. OSM-LO-73 is pretty interesting, too, in that the ether linkage would, I expect, remove the benzylic metabolic liability, but it's still around 1 uM, and has what we though was a suboptimal linker length.

drc007 commented 1 year ago

@edwintse Is MMV1964890 racemic?

edwintse commented 1 year ago

@drc007 Yes, it is racemic. Unfortunately I don't have enough of it to do any chiral HPLC testing.

mattodd commented 1 year ago

@edwintse @drc007 OK, but let's think about that. It might be a compound we should get mic clearance data on, no? i.e. could we should we make some more to look at it in a little more depth? (including possibly enantiomer separation). I suspect predicted solubility is low, though.

edwintse commented 1 year ago

@mattodd I can make more. The alcohol is fine. It was just the SNAr that was a bit low yielding after purification. Datawarrior gives a clogp of 2.7 for this

drc007 commented 1 year ago

@edwintse would it be worth resolving the alcohol first?

edwintse commented 1 year ago

Possibly, although sometimes I don't completely purify the alcohols. I can see what I have left from when I made it.

GemmaTuron commented 1 year ago

hi @edwintse and all!

These are great news! Very excited about these results, thanks! Would it be possible to have a short meeting for us to understand what would be more interesting to explore (for example, more compounds very close to this space, another space that we haven't looked into, revisit some of the predictions we made...)?

mattodd commented 1 year ago

@GemmaTuron Yes, let's. This coming Thursday pm would work at e.g. 3 UK time? Or 4pm UK time Friday? Happy to have it an open meeting so others can join/suggest if want?

GemmaTuron commented 1 year ago

Hi @mattodd !

This week is complicated on our side, can we do NEXT Thursday (10th) at 15:00 UK time? Of course happy to have it open.

mattodd commented 1 year ago

No good - Friday 11th at 1, 3 or 4 UK? Otherwise I fear we may have a looming Doodle Poll 🤕

MFernflower commented 1 year ago

With regard to EGT614 would it be possible to make the analog with an extra carbon between the benzocyclobutane and the core? Seems like truncating the alkyl ether chain to anything other than ethyl can drop potency @edwintse

On Tue, Nov 1, 2022 at 11:19 AM Mat Todd @.***> wrote:

No good - Friday 11th at 1, 3 or 4 UK? Otherwise I fear we may have a looming Doodle Poll 🤕

— Reply to this email directly, view it on GitHub https://github.com/OpenSourceMalaria/Series4_PredictiveModel/issues/34#issuecomment-1298690791, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAYEWDW2Q2M4EBJWBAFQNSLWGEYHZANCNFSM453SAKIA . You are receiving this because you were mentioned.Message ID: @.*** com>