OpenSourceMalaria / Series4_PredictiveModel

Can we Predict Active Compounds in OSM Series 4?
7 stars 10 forks source link

Evariste Technologies compounds #29

Open abrennan5 opened 3 years ago

abrennan5 commented 3 years ago

Hi all,

We’ve already caught up with @mattodd about this but will just give a quick intro for everybody else. Evariste Technologies is a start-up focusing on a probabilistic approach to drug discovery. Briefly, the platform we’ve built, Frobenius, takes an existing dataset and identifies the most promising starting point/s, then designs a bunch of new compounds and scores them according to the likelihood of achieving a set of pre-specified endpoints.

We were really interested in the recent publication detailing the open competition run by the OSM team and thought we’d have a crack at the problem ourselves. The compounds attached are the output generated by Frobenius when it’s presented with the series 4 data. More specifically, we’ve taken the two most promising starting points, applied the various compound designers and selected a subset of the highest scoring compounds (filtered by a medicinal chemist for synthetic feasibility etc). The number associated with each compound is the probability of it achieving a pIC50 of 8.

As we mentioned to Mat, we’re keen to get some of these synthesised and are able to contribute towards cost of synthesis. We’re also more than happy for anyone interested in these compounds to use them as inspiration for similar structures, if this is the case, we’d really appreciate being kept in the loop as we can very readily score the idea in Frobenius.

If anyone’s interested in knowing more about the modelling than the (very) brief overview I’ve given here we’re happy to discuss in detail.

Best wishes, Alfie

https://www.evaristetechnologies.com https://www.linkedin.com/in/alfie-brennan-746ba6b1

Evariste Suggestions

Series4_EVT_Suggestions.xlsx

Malaria suggestions pIC50 8 .pdf

mattodd commented 3 years ago

Hi @abrennan5 - this is very interesting, thanks for posting. It'd be really great to make some of these, yes.

Could someone please take a look and generate a pic of the molecules we can quickly post into an issue, just to help with digesting the ideas simply (I know there's a PDF, I'd just like a cdx. Alfie if you have one, you can drag it here, but you may have to zip it up for Github to accept it).

I'm intrigued by the meta-subst rings in the northeast, which is something we played with a bit in @maratsydney 's work, but have not explored a lot. The accessibility of those will depend on availability of materials, I suspect.

For EVT-004 there's a CF3 there. I thought we'd done that?

Also of interest is the EVT-007, EVT-008, EVT-009. Didn't we take that N away (in EVT-009) already? Replacement with C-F and C-Cl is something we've not done.

abrennan5 commented 3 years ago

Hi Mat,

Compressed cdx below. Each compound designer should drop suggestions that are already in the dataset but if any of these have cropped up before let me know and we can fix that.

Malaria suggestions pIC50 8 .zip

edwintse commented 3 years ago

I've updated the original post with the structures of the suggestions. Just some quick comments:

For EVT-004 there's a CF3 there. I thought we'd done that? We have 1 compound with an OCF3 group as in EVT-004 but it has an amide on the LHS (MMV675963/OSM-S-271/TM 55-1).

Didn't we take that N away (in EVT-009) already? Yes, we have 4 compounds with X = H instead of that N atom and a 4-CN on the RHS (from Patrick Thomson's work). Examples include both ether and amide linkers on the LHS but all are inactive.

I'm not sure if the synthetic accessibility of the benzylic OCHF2 group has changed but that was one of the things that we weren't able to remake from the inherited compounds. If we can't find a way to make that side-chain then all the compounds based on Starting Point 2 may not be accessible.

I'll have a look at whether the aldehydes needed for the different cores are available in the meantime and update later.

abrennan5 commented 3 years ago

Great to meet you Edwin, thanks for uploading the image. Re the compounds containing the benzylic OCF2H, the OCH3 analogue was also one of the top ranked hits, see below for a set of suggestions from that starting point. Perhaps unsurprisingly, the model scores modifications of the methoxy group relatively highly as it views changes here as likely to increase potency. There were a significant number of designs in this region, all with around the same 4 - 7% chance of having pIC50 > 8. I selected the smaller changes to avoid having too much of an impact on solubility and logD. I've also (hopefully) fixed the filtering such that there shouldn't be any compounds already present in the dataset included here but please let me know if that isn't the case.

Let me know your thoughts around this set of molecules as well as just replacing the OCF2H with OCH3 in EVT-011/021.

OMe_suggestions.zip

abrennan5 commented 3 years ago

Hi @mattodd and @edwintse,

I hope you're both well. Just wanted to follow up with our most recent work here, we're actually planning to publish the process of generating these ideas in a blog post at some point soon but I wanted to get them uploaded here first as it builds on so much of your hard work. You'll note the compounds suggested here are slightly different to those above. There are a couple of reasons for this; most importantly, instead of just optimising for improved potency, we asked the model to also aim for molecules with improved solubility and logD in the range of 0 - 4 (essentially 1 - 3 + prediction error). We have also used an updated version of our platform which predicts our error more accurately.

Finally, when selecting compounds we used a 'pessimistic design' algorithm which picks molecules for synthesis predicated on the failure of all the preceding molecules. This filters our list of 30 designs (based on three different starting points) down to 10. You'll note that in the PDF there are two predicted potency values, black is independent of the result for the earlier molecules, red is presuming that the preceding molecules failed to hit the desired endpoint (pIC50 > 8).

None of these feature the troublesome benzylic OCF2 due to the prioritisation of molecules which were predicted to be more soluble.

Happy to field any questions and share a wider selection of designs if that would be helpful.

Best wishes, Alfie compound_picker_suggestions.pdf compound_picker_suggestions.zip

edwintse commented 3 years ago

Hi @abrennan5, I've finally had a look into potential synthesis of these compounds and it doesn't look like we'll be able to make any from your last post. I think the main issue is with the difluorophenyl ring - these substrates are typically more expensive to get a hold of compared to phenyl ring analogues, and this is especially the case when there are additional substituents on the ring. Just based on gut feeling, the compounds in your original post from Starting Point 1 are also less desirable as changes to the NE portion tend to decrease potency. I'm starting to think that the Starting Point 1 compound isn't the most ideal in terms of synthesising analogues. Any chance you'd be able to redesign some new compounds with this in mind?

abrennan5 commented 3 years ago

Hi @edwintse, thanks for taking the time to review the molecules and potential routes, we really appreciate your input here. We can absolutely go from a different starting point, I'll remove both fluorines from that ring and see where we end up - most of the suggestions will design something back in in those positions, but the overall complexity should be lower.

I'll also look at building a virtual library using the published route, this sort of forward synthesis is something we're introducing more and more often on our projects. We can also take into account cost when doing this (very much a beta version of this feature at the minute and it is highly dependent on where the building block comes from).

I'll get something back to you later this week.

abrennan5 commented 3 years ago

Hi @edwintse,

I ran the modelling again and selected some starting points which should (hopefully) be somewhat easier/cheaper to access. The two most promising analogues without any of the complex/expensive SMs discussed previously were:

I applied our design algorithms but I also built virtual libraries using in stock alcohols and aldehydes available from Enamine so at least some of the compounds attached should be easy to plug into the existing route. The building block price filter is still work in progress so I wasn't able to apply it here.

The modelling was the same as the previous post so the designs have been biased for improved solubility and then selected based on predicted potency. Changes to the NE portion still crop up and I suspect this is the case because the modelling views regions of steep SAR as relatively promising for finding highly potent molecules. The reason for this is that they might well be rubbish, but they could also be great if they pick up the right interactions close to the surface of the protein. There are a few anilines present which I would normally filter out, but in this case I chose to leave them in as they are also likely to be easy to make.

Happy to hear your thoughts on these, have a great weekend! Alfie new_starting_points.zip

edwintse commented 3 years ago

Hi @abrennan5, thanks for the new suggestions! We might have a go at making the higher scoring ones like the cyclic urea (0.07). Interesting that the 0.14 compound is predicted to be active. We've made the same compound but with the NH2CH2 at the 3-position and it ended up being inactive.

abrennan5 commented 3 years ago

@edwintse Glad to hear a few of these might make a target list! I initially couldn't find the molecule you mention so I've had a look and realised that there were a small number (about 25) that had a potency value in the Dundee column but not in the collated 'PfaI EC50 uMol (Mean)' column we'd been using.

I've updated the modelling and the indole predictions are broadly similar with slightly (2-fold) lower success estimates. The drop off is broadly consistent with the other suggestions, with the exception of the 0.14 compound, which is now 0.03 and probably better reflects your understanding of the SAR. No other starting points jump up the list, most of the missing data was for inactive molecules.

Hope that helps clear things up!

edwintse commented 3 years ago

Just to update on where we're at with these compounds:

Evariste progress

abrennan5 commented 3 years ago

That's a great effort, thanks @edwintse! The monobrominated free alcohol looks like a really useful intermediate if the chemistry works on a reasonable scale.

Interesting that nothing happens with the oxidative cyclisation. Do you see any sort of pattern with more/less reactive substrates? Seems like the sort of thing that might give a pretty straight line on a Hammett plot.

edwintse commented 3 years ago

Yes, it would be super useful (hopefully the reaction is a bit cleaner with the reXST'd NBS).

Both hydrazones were essentially insoluble in CH2Cl2 so that could be the main factor (though I would've expected at least a trace amount of the cyclised product with some heating which I didn't see).

abrennan5 commented 3 years ago

Ahh, that makes sense. Hope the scale up is going well!

edwintse commented 3 years ago

Another update on where we're at:

Untitled Wiley-3

In terms of going back and making the actual predicted target compounds, I'm less sure about. The benzimidazole compound (EGT 541-1) was actually already made by @maratsydney and it ended up being inactive (>25 uM). I don't see there being too significant a difference in potencies between the two ether side-chains.

With that in mind, are you able to put these 2 compounds (EGT 540-4 and EGT 541-1) through your model to see if they still get predicted as active?

abrennan5 commented 3 years ago

Hi Ed,

That's great work, thanks for grinding through the chemistry! As you suggest, the model doesn't expect great things of EGT 541-1, it predicts a pIC50 of 5 +- 0.6 at roughly a 70% Cl which is pretty much the bottom end of the available data.

It predicts slightly better things for EGT 540-4, 5.6 +- 0.5 which doesn't sound like a huge improvement but it's certainly a step up on the inactive prediction.

I'll re-run the modelling tomorrow using a virtual library constructed based on the chemistry above and update the top suggestions for that ether substituent.

mattodd commented 3 years ago

Yes, at this stage we really want to find any compounds with the same black structure as 540-4 or 541-1 but with variations in orange - i.e. meta- and para-subst rings - where that subunit is on a commercially available boronic acid. i.e. what can we access using these Suzuki couplings.

abrennan5 commented 3 years ago

Hi both, we've built a virtual library based on a Suzuki coupling of the above bromide and all enamine in stock boronic acids then scored the compounds based on our models. I've attached the top 100 (sorted by probability of having pIC50 > 7 and logD 1 - 4). The pdf/cdx are the top 10.

You can sort for predicted potency (potency_mu), or only the probability of achieving pIC50 > 7 (potency_prob_success) if you were more interested in that than the logD.

Let me know if there's anything else we can do to help, happy to provide the whole library or score for solubility/clearance as well if that would be of interest

enamine_boronics_virtual_lib.csv enamine_boronics_virtual_lib_10.pdf enamine_boronics_virtual_lib_10.zip

mattodd commented 3 years ago

That's really great. @edwintse what do you think? We absolutely are interested in potency, but also logD. We need, what, 3-5 compound suggestions with a sweet spot combination of the two. That they can (likely) be made from the same core is a real bonus here.

edwintse commented 3 years ago

The top 10 look good. Some of the boronic esters/acids are a little pricy but about half are pretty cheap. Will have a think which to go for

edwintse commented 3 years ago

@mattodd Ok, so I've had a look through the top 10 and summarised the details below. I also put the compounds through Ersilia's prediction app as a cross-check and added the probabilities of being active below each compound. The prices of the boronic acid/ester from Enamine and Fluorochem are listed as well. I've deprioritised the compounds in red as we're more interested in 3,4-substituted phenyl rings. The top middle compound in green seems like an obvious one to go for as it's predicted active by both models and the reagent is cheap. The pyrrolopyridine in the top left could also be one to go for? Thoughts on the others?

enamine_boronics_virtual_lib_10

mattodd commented 3 years ago

Nice! OK, so if we number 1-10 from top left then I'd go for

1 - yes 2 - definitely 3 - maybe, expensive 4 - no, not 3,4 disubst 5 - yes 6 - no major advantage over 1 7 - no major advantage over 2 8 - Expensive but nice logD 9 - no, not 3,4 disubst 10 - expensive but quite nice

So 1, 2, 5 and maybe one more? Any thoughts on these from @abrennan5 @GemmaTuron @miquelduranfrigola @drc007 @jonjoncardoso ?

abrennan5 commented 3 years ago

Thanks for annotating @edwintse! Apologies for not adding compound codes, next update to the platform we're going to do this automatically.

Mat, I agree with your summary - 1, 2, and 5 look good. The one thing I would point out is that I think the calculated logD has struggled with either 2 or 7 as I would expect the CF2H to be 0.5 - 1 unit lower than the corresponding CF3. Benzylic C-F bonds are weird though so hard to say without measuring them both. I'd be tempted to include 7 on that basis but it might also be worth investigating a slightly more different analogue (ie 8) if you were only going to purchase one expensive precursor.

abrennan5 commented 3 years ago

Discussed with the team and we're happy to purchase the boronic acids for 1, 2, 3, 5, and 8 if that looks like a good list to you both. Let me know if a call would help sort out the details of where to get them delivered etc.

jonjoncardoso commented 3 years ago

Interesting stuff! I can run those over our modSAR model to see how and if the predicted pIC50 matches those!

mattodd commented 3 years ago

That's extraordinarily generous. I'll discuss with Ed (typical yields, hence how much we might need) and reply ASAP.

jonjoncardoso commented 3 years ago

Here are the predictions made by our algorithm for this set of molecules: 2021_07_21_modSAR_predictions.zip

Predictions didn't seem to correlate much but our model also endorses the green compound (marked original_id=1600) on this visualisation:

image

I have posted the step-by-step on how to reproduce these predictions on this Jupyter notebook.

mattodd commented 3 years ago

Most interesting, thanks @jonjoncardoso. Would you (or @edwintse ) be able to correlate the ID numbers here with the structures of the most interesting? i.e. 921 is the CH3 equivalent of our top scorer? And 1958 and 1501 are already in the set we're considering? What about those four in the quadrant above 5.7 Evariste and above 6.1 modSAR? Do they share anything?

While I think of it @edwintse let's be sure that none of these predicted compounds have already been made!

Re buying the boronics/boronates: Thank you again @abrennan5 for your very kind offer. I think 100 mg is likely to be enough of each, given how the Suzukis are going (above) and that it might take two attempts. @edwintse would you be able to put together a shopping list for at least 100 mg of the reagents for 1, 2, 3, 5 and 8 that maximises delivery speed while minimising any associated delivery costs?

abrennan5 commented 3 years ago

In terms of an optimised price list, I don't know if you've used MCule before @edwintse but they provide a service that does this (if it can find the building blocks) https://mcule.com/search/

Given how difficult this dataset is to predict on, I'd say that's a reasonable correlation between the two models. The range of ours is slightly lower (probably due to differences in data cleaning). 921 is the CF3 -> tBu analogue of 1600 and is lower down our list due to the higher logD prediction. 1958 is compound 10 in the above pdf (cyclopropyl ether) and 1501 is compound 7 (CF2H). Seems that both models like 4-Cl, 3-small lipophilic group!

miquelduranfrigola commented 3 years ago

Nice! OK, so if we number 1-10 from top left then I'd go for

1 - yes 2 - definitely 3 - maybe, expensive 4 - no, not 3,4 disubst 5 - yes 6 - no major advantage over 1 7 - no major advantage over 2 8 - Expensive but nice logD 9 - no, not 3,4 disubst 10 - expensive but quite nice

So 1, 2, 5 and maybe one more? Any thoughts on these from @abrennan5 @GemmaTuron @miquelduranfrigola @drc007 @jonjoncardoso ?

Hi @mattodd @edwintse @abrennan5 @drc007 @jonjoncardoso

Really great results! We have looked into the 99 candidates that @abrennan5 shared and scored them according to 2 of our metrics (IC50Pred and DeepActivity). In this file you can find our selection (intersection of the best quartile of the two metrics = 17 candidates):

evariste_eosi_filtered.csv

In brief:

About the metrics:

Below a quick viz of the 17 candidates (Rank = order in the original list): evariste_eosi_selection

edwintse commented 3 years ago

@abrennan5 I've had a look at the prices on MCule have listed them below. The prices below the short line are from Enamine or Fluorochem. I've had a look at a few of the more interesting ones from Ersilia's latest post as well. Based on the quote I got from MCule the shipping fees are a bit too expensive... I think the best bet would be to get everything from Enamine to make the most of the €60 shipping fee (everything is also in stock from them too). How's this look to you? Untitled Wiley-12

abrennan5 commented 3 years ago

@edwintse Thanks very much for doing all the research, shame about the MCule shipping fees! Once we've got this first set tested we can have another look at the modelling and design/rescore analogues based on that. I'll have a quick check, but I reckon getting 4 from Enamine and the CF3 indole from fluorochem is probably the most cost effective method.

What address should we use when placing the order?

abrennan5 commented 3 years ago

Starting materials should be with you in the next couple of weeks! Enamine very kindly waived the delivery fee as it was for an anti-malarial project

edwintse commented 3 years ago

@abrennan5 Amazing!! Thanks so much for organising this!

mattodd commented 3 years ago

Yeah, that's awesome @abrennan5, thank you, really exciting.

edwintse commented 2 years ago

@abrennan5 I've finally finished making the 5 compounds! I was having a bit of trouble with the 3-CF3,4-Cl coupling and ended up isolating the dechlorination product as well (EGT 552-1). We'll try and organise a time to get these compounds tested soon.

Evariste 2

abrennan5 commented 2 years ago

Thanks Edwin! That's absolutely great, love a bonus compound. Really appreciate all your work and looking forward to seeing the results.

edwintse commented 2 years ago

@abrennan5 @mattodd Potency results just back in. EGT 92-1 is a positive control that we also included in the assay. Three compounds (92, 553, 554) need repeating as their results were >3 fold difference in the initial repeats. Potencies are in uM

Untitled Wiley-8

abrennan5 commented 2 years ago

Hi @edwintse, thanks for the update and for organising the testing! Some interesting if not super exciting results. I think it's fair to say that the SAR really does not track in this series. I'll update the modelling and get back to you on Monday.

Thanks again!

abrennan5 commented 2 years ago

@mattodd @edwintse Hope you're both well, here are some of my thoughts on the above data:

Really appreciate all your help on this project, it's great to have some data we can discuss openly.

mattodd commented 2 years ago

That's great @abrennan5. Interesting to get your thoughts here. I'd love to engage in another round and will check in with @edwintse. Are you able to post structures or SMILES for the new ones so that we can consider cost and synthetic difficulty?

abrennan5 commented 2 years ago

@mattodd @edwintse

The structures here are the 10 things we would make next and the csv file contains the full library scored using the updated models. The values below the structures are: predicted pIC50, the error bar around that prediction, and the conditioned probability of achieving pIC50 > 7.

By conditioned probability I mean that, having selected the 'best' molecule, we assume that it fails to hit the endpoints we want, and rebuild the model before selecting the second one. This helps remove very similar molecules that have clustered at the top of the list. The full library contains the raw probability of success scores so you can look further down the list if you want.

Our predicted sigma is much higher than we normally see on datasets with 400ish compounds. This isn't necessarily a bad thing as it hopefully reflects the relatively steep SAR you seem to encounter in this part of the pocket. It also takes into account that the SAR between the NE and NW groups is not at all additive.

If we were to look into other chemistry, it would be to take a few of the more potent groups we have already found or those that show up in the next round and try them with different ethers. Perhaps those that include the benzylic substitution found in some of the more potent compounds.

Let me know your thoughts!

top10_conditioned boronic_library_rescore.csv.zip

Minor edit: Added compound numbers. Also, we didn't filter out the aryl chlorides in the list below as they are semi-compatible with this chemistry but obviously it might lead to some useful but potentially annoying side products.

edwintse commented 2 years ago

Hi @abrennan5, thanks for generating some new compounds! I've had a look for the commercial availability of the corresponding boronic esters and they are shown below.

Untitled Wiley-5

I was looking at the csv file and there's a compound in the list that isn't in the figure. The compound numbered 620 (2-OMe;3-OH) in the figure isn't in the list, and the compound numbered 250 in the list (3-Me; 5-NO2) isn't in the figure but is in the list?

The pyrrole might be interesting (closest we have already are imidazole and pyrazole, both were inactive). The triazolopyridine might also be interesting The 3,5-Cl;4-OCHF2 one might be good too

abrennan5 commented 2 years ago

Hi @edwintse,

Sorry that's my mistake! The list is supposed to be the full library but I've accidentally taken a subset instead. The full library is now attached to this message, although the compounds are numbered differently. See if you're interested in any of the others in there.

You can also sort the compounds by "potency_mu" which is our mean prediction. You'll see that the top scorers here are more similar to the last round - number 10 and number 7 in original picture (your post, July 21st) are still predicted to be about 1 uM but unlikely to be a substantial more active than that. It might be interesting to include a couple of the higher confidence/exploitation compounds and a few of the riskier bets.

Maybe a list that looks something like 7 + 10 from the first list, as well as 1, 2, 5, and 8 (all in stock and fairly cheap) from the second round of designs. Does that sound reasonable to you? If we're lucky, 7 and 10 will also provide the de-chlorinated analogues which are predicted to be modestly potent as well.

boronic_library_rescore.csv.zip

edwintse commented 2 years ago

Hi @abrennan5, here's the list of the 6 compounds and their prices and codes from Enamine. I'm making more of the brominated core so it'll be ready to Suzuki when these arrive. Thanks again!

Untitled Wiley-5

edwintse commented 2 years ago

@mattodd I've uploaded the data curves from Mark for the latest compounds we got back. MMV1903416 and MMV1903417 were run 4 times each. The last repeats for those were inconsistent so Mark said to take the values that are highlighted in red (2 each which would be averaged).

Ed Report.xlsx

edwintse commented 2 years ago

Hi @abrennan5, sorry for the delay. I've just finished purifying all the compounds and just need to NMR check them but hopefully they're all ok. I managed to get a number of bonus compounds which were the result of dechlorination or double Suzuki coupling.

Evariste Set 3

@mattodd In total there should be 12 compounds for testing (11 of these + 1 positive control)

abrennan5 commented 2 years ago

Thanks @edwintse, this is brilliant! Great job getting everything separated - if those double suzuki bonus compounds are any good we're going to have a job getting the logP down.

I'll run all the compounds through the model again, although the double suzukis will fall outside the domain of applicability I'd imagine. Looking forward to seeing the results!

MFernflower commented 2 years ago

@edwintse @abrennan5 Now i'm curious as to what a naked or mono-chlorinated biphenyl would do

CLogP goes straight into the skip but curious nonetheless

Para-biphenyl-boronic acid is not that costly:

image

edwintse commented 2 years ago

@mattodd @abrennan5 @miquelduranfrigola @GemmaTuron Results are in!! This round looks particularly promising. Quite a few of them are <1 uM.

Evariste Set 3