Open edwintse opened 5 years ago
Interesting, our team started reviewing the previous runs, datasets etc. We hope to have some promising models.
Very interested in participating, and iterating on the last competition. Is there a formal definition for the core of Series 4? Curious to know where we can enumerate and where we can't.
@spadavec I think it would be best to stick to the triazolopyrazine core with substitutents in the northwest and northeast positions (e.g. MMV897698 as a simple example) considering the better potencies that we typically get with those.
Was there ever a full formal writeup for the first round? I see at #538 that it was delayed due to data embargoes and such, hopefully those have passed.
@jsilter There hasn't been yet, but I am in the process of writing it up on the wiki in this repo so check back there soon. At the same time I am also drafting up this info for the paper (I'll create a new issue about this shortly).
Not clear exactly what we should leave in the submitted models folder. Would a prediction for missing values of each compound already in the sheet suffice? Or does it have to be a binary capable of taking a new compound SMILES and outputting the predicted activity?
By working openly, does this mean I can just place my data etc. in a repository e.g. https://github.com/BenedictIrwin/OSM and update that as I make progress?
@BenedictIrwin Hi, I have added some details about what will be required for submission to the original post above, but it is more the latter. In short, all entrants will be provided with the molecular identifiers (e.g. SMILES) for a set of existing Series 4 compounds (where we have not revealed the experimental potencies) and you will be required to predict the potencies for these compounds.
Yes, working openly means that at any stage, if someone wants to see the progress you've made, the can easily look at your work on an ELN or on Github. Feel free to place your data/working in a repository (either this one or your own) and update/provide links as you make progress.
Hi, I try to get my head around the provided activity data:
Re the data in Sheet 'Ion Regulation Data for OSM Competition':
Willem
@wvanhoorn I'll try to answer these as best as I can.
1) The compounds in the "Ion Regulation Data for OSM Competition" sheet have associated PfATP4 data (i.e. do they have ion regulation activity or not). However, this list contains non-OSM compounds as well. The "Master Chemical List" is the complete list of OSM compounds from Series 1, 3 and 4 with in vitro potencies (n.b. any compound from Series 1 is also known to be inactive against PfATP4). Round 1 of the competition was more focussed on the prediction of active compounds against PfATP4 (not limited to Series 4). For this round, we are looking for predictions for the activities of Series 4 compounds specifically so you can use the Master Chemical List to train your models. 2) Yes, those entries can be ignored. 3) Ion regulation activity indicates whether or not it is active in the PfATP4 assay (1 means the compound shows ion regulation activity, 0 means it doesn't). In the case of Series 4, we see correlation between PfATP4 activity and in vitro potency, so any OSM compound in the list should be relatively potent. Any OSM compound without a number in this column can be found in the Master List and can be used for training the predictions. 4) The compounds in these rows are from the MMV Malaria Box and Pathogen Box and haven't been evaluated against the parasite. Considering that these compounds are all structurally different from Series 4 compounds, I'm not sure how helpful they will be for developing a model to predict the activities of Series 4 compounds specifically, so perhaps it's better to ignore them? 5) No difference. The data in rows 836-959 were just added more recently and haven't been sorted. 6) Generally, our compounds as classified as active if they are <1 uM, weakly active between 1-2.5 uM, and inactive >2.5 uM. 7) Yes, the final test set will be provided at a later data. It's understandable that the model itself won't be able to be shared. We are not focused as much on the actual method, but the accuracy of the prediction. Each submission will need to provide the predicted potencies for this test set. By comparing these predictions with the experimental data for the test set, we can determine which models perform the best. The best model(s) will then be asked to generate new active compounds that will then be synthesised and tested.
Let me know if you have any further questions
Hi,
I'm in the process of creating a dataset containing two fields "SMILES" and "Active/Inactive" status. If I ignore all records where "Smiles" are missing and "Ion Regulation Activity" are neither 0 or 1, I got 576 "clean" compounds (510 - inactive and 66 - active)
Taking into consideration @edwintse comments, may I apply the following rule to records where "Ion Regulation Activity" is missing but "Potency vs Parasite (uMol)" is available?
Rule:
if "Potency vs Parasite (uMol)" < 1:
"Ion Regulation Activity" = 1;
else:
"Ion Regulation Activity" = 0;
Nick
@mmgalushka This rule could only be applied to OSM Series 4 compounds since we know there is correlation between ion regulation activity and in vitro potency. I don't think this could be accurately applied to the other compounds from the Ion Regulation sheet since there are lots of different structural classes of compounds for which we don't know if there is any correlation.
Will the activity of the test molecules be measured as their activity in the ion regulation assay, the Pfal EC50 assay, or both? Do we have known tolerances or errors for either assay?
@spadavec The test compounds will have been measured in the Pfal IC50 assay only. I'm actually not sure about the specifics of the assay tolerances/errors, but I know that the Pfal IC50 assay uses Mefloquine as the standard control with an acceptable pIC50 range of 7.5-7 if that helps.
I'm not from BioChem background and a little bit lost in different domain-specific terminology. I'm posting a script which I'm using to clean the dataset.
I exported "Ion Regulation Data for OSM Competition" file in TSV format and applied the following script:
Ion_Regulation_Activity = 2
Smiles = 4
with open('datasets/Ion Regulation Data for OSM Competition.csv', mode='w') as w:
with open('datasets/Ion Regulation Data for OSM Competition.tsv') as r:
for record in r.readlines():
fields = record.split('\t')
activity_value = fields[Ion_Regulation_Activity].strip()
smiles_value = fields[Smiles].strip()
if len(smiles_value) > 0 and activity_value in ['0', '1']:
w.write(smiles_value + ',' + activity_value + '\n')
The output is CSV file with to columns "SMILES" and "ACTIVITY". I got 851 compounds in total, where 66 is active.
Am I on the right track? Do I need to consider something else?
I am still confused and it seems that I am not the only one. This runs the risk of becoming a data interpretation/cleaning instead of data modeling competition. Could we therefore settle first on a single file with all relevant data without any irrelevant data (for instance only series 4 compounds if the aim is to only predict series 4 compounds) so that we all depart from the same starting point? And provide a specific description what needs to be modeled. I initially thought the aim was to model 'Potency vs Parasite (uMol)', now it seems it should be 'Ion Regulation Activity' but I am still not sure.
I 100% agree with @wvanhoorn. It would beneficial for all teams to have a single file with samples only relevant to this competition, which containing input feature(s) and a target feature.
Hi all, Apologies if there has been any confusion. To clarify, the aim of this competition is to predict the Pfal IC50 potencies of Series 4 compounds that are active against PfATP4. This is slightly different to the aim of Round 1 where the aim was more broadly to predict any active compounds against PfATP4.
Both spreadsheets are supposed to be complementary. The idea behind the two are as follows:
Ion regulation spreadsheet We highly suspect PfATP4 to be the target for the Series 4 compounds (potent Series 4 compounds show activity in the ion regulation assay; this is indicated by a 1 in the ion regulation activity column) but the structure of the target protein has not been solved. This means that we don't know what key interactions our compounds are making with the target. The ion regulation spreadsheet contains all known compounds (from many different chemotypes) that have been experimentally evaluated against PfATP4. All of this structural information (along with the provided homology model and relevant mutations) can be used to aid in discerning any key interactions that might be taking place, and therefore be used to predict new potent Series 4 compounds that exploit these interactions.
Master Chemical List This list contains all OSM compounds from Series 1-4 with in vitro potencies with the additional knowledge that Series 1 does not target PfATP4 (i.e. 0 for ion regulation activity). As we are specifically looking for predictions on compounds with a triazolopyrazine core, the changes in structural features between Series 4 compounds and their associated in vitro potencies can be used to develop and refine your models.
n.b. The models will be evaluated for their ability to predict the potencies of a test set that consists of Series 4 compounds only.
With that in mind, you are free to use as much or as little of the provided data that you think will best achieve this goal. I believe that by providing all the data, all aspects can be considered when developing the models.
I try to make the following statements regarding "specifically" my model.
My model takes only one feature (SMILES) as an input. According to your comments am I right to say that we are trying to predict the potencies, which defined in "Ion Regulation Data for OSM Competition" file under the field "Potency vs Parasite (uMol)"? If this is true, our model should predict "real" values.
To summarize above, we need to build a regression model which predicts "Potency vs Parasite (uMol)" by compound "SMILES". Do I make the right conclusion?
PS: I understand that some potency values can be sourced from "Master Chemical List" file, but at this stage, I just want to concentrate on "Ion Regulation Data for OSM Competition" file.
@mmgalushka Yes, that's correct. Totally fine to just concentrate on the one file at this stage.
Thanks a lot @edwintse!
I used the following Python script to extract records:
Potency_vs_Parasite = 1
Smiles = 4
with open('datasets/Ion Regulation Data for OSM Competition - Malaria Molecules.csv', mode='w') as w:
with open('datasets/Ion Regulation Data for OSM Competition - Malaria Molecules.tsv') as r:
for record in r.readlines():
fields = record.split('\t')
potency_value = fields[Potency_vs_Parasite].strip()
smiles_value = fields[Smiles].strip()
if len(smiles_value) > 0 and len(potency_value) > 0:
try:
float(potency_value) # make sure this is a real value
w.write(smiles_value + ',' + potency_value + '\n')
except:
continue
Got the following file;
There are many records which potency exactly 10 and 50, considering that the majority records between 0 and 8.0. Are these values "10s" and "50s" correct?
Compounds with potency values of 10/50 OR have a potency qualifier of '>' can be treated as inactive. It means that the IC50 values were greater than the max concentration that was tested in the assay.
@edwintse Thanks for all of the clarification! Just as a follow up, if you consider only S4 compounds that have enough data to contribute to a regression model (e.g. have potency and SMILES strings) there are only ~130 compounds, which is definitely on the low side for an accurate model (typically this number needs to be closer to ~500 for pIC50 values to have an error rate of ~1, which is getting close to on-par with errors in wet measurements of IC50/EC50 values). If we expand the criterion for acceptance to be over/under 1uM (e.g. just a classification job), the accuracy and results should be much better across the board--has that been considered at all for this?
@spadavec Are these ~130 S4 compounds from the ion regulation spreadsheet that have both potency vs parasite and ion regulation activity? or just potency vs parasite and not ion regulation activity? There should be close to 350 S4 compounds that have potency vs parasite data (all on the Master Chemical List but you have to filter out intermediate structures). Still, this is lower than the desired number of compounds.
It sounds reasonable to me to expand the criterion if that will provide better accuracy/results for the model.
In one of the previous posts, @edwintse wrote:
Generally, our compounds as classified as active if they are <1 uM, weakly active between 1-2.5 uM, and inactive >2.5 uM.
It was regarding "Potency vs Parasite" field in the "'Ion Regulation Data for OSM Competition" file.
at the same time another quote:
Ion regulation activity indicates whether or not it is active in the PfATP4 assay (1 means the compound shows ion regulation activity, 0 means it doesn't).
It was regarding "Ion Regulation Activity" field in the "'Ion Regulation Data for OSM Competition" file.
I selected 2 examples (however there are a number of similar samples in dataset) from the "Ion Regulation Data for OSM Competition" file:
SMILES | Potency vs Parasite (uMol) | Ion Regulation Activity |
---|---|---|
CC1CN(CC(C)O1)C(=O)c2sc3ccccc3c2Cl | 0.0255 | 0 |
FC(F)OC(C=C1)=CC=C1C2=NN=C3C=NC=C(OC(C)C4=CC=C(F)C(F)=C4)N32 | 8.586 | 1 |
Question:
I'm sorry maybe I completely miss understand these values. I initially thought there is some relation between "Potency vs Parasite (uMol)" and "Ion Regulation Activity"...
@mmgalushka Not all compounds in the ion regulation data spreadsheet will show correlation between the two assays.
The first compound CC1CN(CC(C)O1)C(=O)c2sc3ccccc3c2Cl was part of the MMV Malaria Box (indicated by M in the Ion Regulation Test Set Column). This box contained 400 potent antimalarial compounds with many different structures that could be used for further development. Only 28 of the 400 compounds were found to have any ion regulation activity. The remaining 372 can be thought as having a different MoA to PfATP4.
A similar thing can be said for the MMV Pathogen Box compounds (indicated by P in the Ion Regulation Test Set Column). Of the 400 compounds (~120 antimalarials; the rest are for other indications), only 11 were found to have ion regulation activity.
Regarding the S4 compound FC(F)OC(C=C1)=CC=C1C2=NN=C3C=NC=C(OC(C)C4=CC=C(F)C(F)=C4)N32, there is high confidence that there is correlation between potency vs parasite and ion regulation activity, however there were 3 outlier compounds when this batch was evaluated where this relationship didn't hold true (the 3 compounds are shown here).
So essentially, for OSM S4 compounds in general, you can consider this relationship to be correct.
Thanks a lot, @edwintse for this clarification!
I still trying to understand the relation between "Potency vs Parasite" and "Ion Regulation Activity". I collected the following stats based on "Ion Regulation Data for OSM Competition" file:
Potency vs Parasite | Ion Regulation "inactive" | Ion Regulation "Active" |
---|---|---|
(uMol) <= 1 uM | 278 | 40 |
(uMol) > 1 uM and (uMol) <= 2.5 uM | 124 | 17 |
(uMol) > 2.5 uM | 27 | 2 |
PS: I only considered record where both values available and valid: "Potency vs Parasite" and "Ion Regulation Activity".
I'm struggling to understand even we can create the "perfect model" to predict "Potency vs Parasite", how it can help us to predict "Ion Regulation Activity"?
What I'm trying to say, is that even in the interval (uMol) <= 1 uM we have the challenge to classify Ion Regulation Activity correctly...
Keep in mind, the predictions that we are seeking are targeted for OSM Series 4 compounds. We are not so much interested in predicting ion regulation activity for Series 4, rather, we want to predict the potency vs parasite since we can make the assumption that active Series 4 compounds against the parasite will be active in the ion regulation assay as well.
Thanks a lot @edwintse! You are absolutely right. I just mixed up again that dataset contains different experiments. I got it now!
@mmgalushka @edwintse I have taken the file earlier created by @mmgalushka, calculated the InChiKey from the Smiles, joined the OSM master list on InChiKey and kept records where series = 4. This leaves 194 compounds from Series 4 that have a potency value. Do we now finally have the training set for the competition?
https://docs.google.com/spreadsheets/d/1ReZz-_I90YYtiyEJucgj_i_6ckMQ3Rr1ocDvfPsgkOw/edit?usp=sharing
The posts below seem to contain the prediction set (assuming they are all series 4). Any more to follow? https://github.com/OpenSourceMalaria/Series4/issues/73#issue-481444589 https://github.com/OpenSourceMalaria/Series4/issues/71#issue-469138536
Yes, those compounds will be used as the prediction set. It seem unlikely that any more will be added considering they would need to be synthesised and sent for testing before the judging occurs. I'll keep you updated if that changes
@edwintse Could you please also confirm the training set I assembled earlier is correct (with the proviso it only contains series 4 compounds, if people want to include other series the set from @mmgalushka should be used)?
Yes, that list looks good to me
Thanks, @wvanhoorn for sharing the dataset!
I found the following duplicates:
Question 1: What potency is correct for the first compound?
FC1=C(F)C=CC(C(OC)COC2=CN=CC3=NN=C(C4=CC=C(C#N)C=C4)N32)=C1 -> 0.1105 or 0.207
Question 2: Are these the same compounds?
The differences only in upper/lower cases, but it might be significant from the chemistry point of view.
OC(COc2cncc3nnc(C1CCCCC1)n23)c4ccccc4 OC(COc2cncc3nnc(c1ccccc1)n23)c4ccccc4
c1ncc2n(c1Oc1cc3c(cc1)CCCC3)c(nn2)c1ccc(cc1)OC(F)F c1ncc2n(c1Oc1cc3c(cc1)cccc3)c(nn2)c1ccc(cc1)OC(F)F
c1(ccccc1)CCCc1cncc2n1c(nn2)c1ccc(cc1)OC(F)F c1(ccccc1)CCCC1CNCc2n1c(nn2)c1ccc(cc1)OC(F)F
C1NCc2n(C1CCO)c(nn2)c1ccc(cc1)OC(F)F c1ncc2n(c1CCO)c(nn2)c1ccc(cc1)OC(F)F
@mmgalushka
Oops. Yes the duplicates were generated by me during the join, I assumed the master list was deduplicated but it's not. I will redo it a bit more carefully and share.
Re 1: a factor of two in measured IC50 or EC50 is normal. In the master list you can find the raw data and see that values easily differ a factor of two or more. The single data points in the competition list are actually averages of multiple measurements in different labs.
@edwintse: this opens another question: it seems that IC50 and Ki data have been averaged even if they are from the same lab. For instance the first entry OSM-A-1: Pfal IC50 (Guy) = 3.05 Pfal (K1) IC50 (Guy) = 4.379 PfaI EC50 uMol (Mean) = 3.7145 (which is the average of the above two numbers) In my understanding IC50 and Ki are two different measurements (relative vs absolute) that can be derived from a single experiment, averaging these two does not make sense since it is averaging the same single observation interpreted differently.
And another issue: OSM-S-35 has duplicate entries for a single assay, at least that is how I interpret the semi-colons. However, in the spreadsheet this equates to a string, not a numerical value and these are all ignored in the final average: Pfal IC50 (GSK) = 0.036; 0.012 Pfal IC50 (Avery) = 0.026; 0.038 Pfal IC50 (Ralph) = 0.011 PfaI EC50 uMol (Mean) = 0.011 (average of the above 5 numbers = 0.0246)
Looks like there is some more data cleaning to do (which normally is 90% of the effort of building a model so not too bad so far).
Re 2: these structures are different! Lower case represents aromatic atoms, upper case aliphatic.
My hopefully last attempt: I have taken the master list as starting point since that seems to contain the original data. All work was done on a snapshot downloaded today (20 Aug 2019).
Result is Master Chemical List - annotated
@wvanhoorn To answer your previous question, both are IC50 measurements but are of different strains of the parasite. Most IC50 values are against the NF54 or 3D7 strain. This Pfal (K1) IC50 (Guy) is against the K1 strain which is multi-drug resistant. A description of the column titles can be found here. You would be correct in that averaging these two measurements is not accurate since they are from different strains. I think the PfaI EC50 uMol (Mean) column was just meant to the relevant potencies.
Yes, those are duplicate entries from the same assay which should be included in the average but aren't.
@wvanhoorn Thanks a lot for cleaning this data! It is much easier to use them.
@edwintse This is a general question about building the regression model. I understand that ideally, we would like to have a model which correctly predict a potency value in any range. However, if the predicted potency value is grater than 1uM does it has any practical meaning in your research. For example, does it make any difference if the predicted potency of a compound >10 or > 25 or > 50...?
I guess it would be a lot less useful to be able to predict inactive compounds accurately so we wouldn't really distinguish between >10, >25 and >50 as being any different
So would it be correct to say that accuracy of the regression model above a certain threshold (let say 10) does not matter?
Yes, that sounds reasonable
@edwintse In the provided dataset their ara many records where potency values are ">10" or ">25"... How you are going to evaluate predictions of the submitted models if your experimental results show ">10" or ">25"... outcomes.
For example, one (submitted) model predicted potency value 12, another model predicted 20 and your experimental result showed ">10". Which model is more accurate?
Maybe we should introduce a potency threshold after which the predicted results should be treated the same...
@mmgalushka As you mentioned before, the ability to differentiate between inactive compounds is not entirely necessary. For instance, if one model predicted a compound to have a potency of 12 uM while another predicted it to be 20 uM, the actual experimental results would depend on the max concentration that the assay was run in (i.e. if the compound gets made and tested, it would return a result of >10 uM regardless).
The accuracy of the models will therefore be determined based on the ability to predict active compounds, say <2.5 uM.
The assay that we currently use has a max concentration of >25 uM, however I would say that much of this upper range is not terribly useful. So perhaps an upper threshold of >10 uM would suffice.
Thanks, @edwintse for the clarification! Am I right to say that submission result would look like this:
Compound | Potency |
---|---|
c1ncc2n(c1Oc1cc3c(cc1)cccc3)c(nn2)c1ccc(cc1)OC(F)F | 0.023 |
CCOc1ccc(cc1OCC)c2nonc2NC(=O)c3cccc(C)c3 | 0.453 |
CCOC(=O)c1ccc2nc(cc(O)c2c1)c3ccccc3 | >2.5 |
CCOC(=O)C1=CN(CC)c2cc(N3CCCCC3)c(F)cc2C1=O | 1.23 |
[O-]N+c1ccc(C=NNc2nc3ccccc3[nH]2)cc1 | >2.5 |
... | ... |
Note: This is dummy potency values. I use it as an example.
So we indicate the potency value up to 2.5 and everything above it just indicated as ">2.5".
@edwintse You replied on this post that compounds for prediction are more likely come from two sources.
Do you have the final set of compounds, which we need to predict? Or it will be provided later?
I will finalise the test set of compounds for the competition early next week and will post it here.
Maybe this data would be useful to someone. I tried to visualize "training" compounds in Series 4 together with "target" compounds announced in #71 and #73 for my research (see visualizations below).
None: In order to do this visualization, I standardized each SMILES (using MolVS) and converted it into a "fingerprint" (using variational autoencoder trained on ChEMBL v23).
Visualization for compounds S4 with compounds 71 is in here
# | Smiles |
---|---|
0 | Fc1ccc(CCOc2cncc3nnc(-c4ccc(C(F)(F)F)nc4)n23)cc1F |
1 | Clc1ccccc1CCOc1cncc2nnc(-c3ccncc3)n12 |
2 | FC(F)Oc1ccc(-c2nnc3cncc(OCCC4COC4)n23)cc1 |
3 | OCC(COc1cncc2nnc(-c3ccc(OC(F)F)cc3)n12)c1ccccc1OCc1ccccc1 |
4 | OCc1ccc(COc2cncc3nnc(-c4ccc(OC(F)F)cc4)n23)cc1 |
5 | c1ccc(CCOc2cncc3nnc(C4CCNCC4)n23)cc1 |
6 | COc1ccc(CCOc2cncc3nnc(-c4ccc(OC(F)F)cc4)n23)cc1 |
7 | COc1ccc(CCNc2cncc3nnc(-c4ccc(OC(F)F)cc4)n23)cc1OC |
8 | OCC(COc1cncc2nnc(-c3ccc4cc[nH]c4c3)n12)c1ccccc1 |
9 | OC@Hcc3)n12)c1ccccc1 |
10 | FC(F)Oc1ccc(-c2nnc3cncc(SCCc4ccccc4)n23)cc1 |
11 | O=C(O)C(COc1cncc2nnc(-c3ccc(OC(F)F)cc3)n12)c1ccccc1 |
12 | FC(F)Oc1ccc(-c2nnc3cncc(OCCOc4ccccc4)n23)cc1 |
13 | O=C(Nc1cccc(Cl)c1)c1cncc2nnc(-c3cccnc3)n12 |
14 | COc1ccc(-c2nnc3cncc(OCCc4ccccc4)n23)cc1 |
Visualization for compounds S4 with compounds 73 is in here
# | Smiles |
---|---|
0 | Fc1ccc(CCOc2cncc3nnc(-c4ccc(C(F)(F)F)nc4)n23)cc1F |
1 | Clc1ccccc1CCOc1cncc2nnc(-c3ccncc3)n12 |
2 | FC(F)Oc1ccc(-c2nnc3cncc(OCCC4COC4)n23)cc1 |
3 | OCC(COc1cncc2nnc(-c3ccc(OC(F)F)cc3)n12)c1ccccc1OCc1ccccc1 |
4 | OCc1ccc(COc2cncc3nnc(-c4ccc(OC(F)F)cc4)n23)cc1 |
5 | c1ccc(CCOc2cncc3nnc(C4CCNCC4)n23)cc1 |
6 | COc1ccc(CCOc2cncc3nnc(-c4ccc(OC(F)F)cc4)n23)cc1 |
7 | COc1ccc(CCNc2cncc3nnc(-c4ccc(OC(F)F)cc4)n23)cc1OC |
8 | OCC(COc1cncc2nnc(-c3ccc4cc[nH]c4c3)n12)c1ccccc1 |
9 | OC@Hcc3)n12)c1ccccc1 |
10 | FC(F)Oc1ccc(-c2nnc3cncc(SCCc4ccccc4)n23)cc1 |
11 | O=C(O)C(COc1cncc2nnc(-c3ccc(OC(F)F)cc3)n12)c1ccccc1 |
12 | FC(F)Oc1ccc(-c2nnc3cncc(OCCOc4ccccc4)n23)cc1 |
13 | O=C(Nc1cccc(Cl)c1)c1cncc2nnc(-c3cccnc3)n12 |
14 | COc1ccc(-c2nnc3cncc(OCCc4ccccc4)n23)cc1 |
15 | Fc1ccc(-c2nnc3cncc(OCCc4ccccc4)n23)cc1 |
16 | O=C(COc1cncc2nnc(-c3ccc(OC(F)F)cc3)n12)c1ccccc1 |
17 | CCN(CC)C(COc1cncc2nnc(-c3ccc(OC(F)F)cc3)n12)c1ccc(F)c(F)c1 |
18 | O=C(Nc1ccnc(C(F)(F)F)c1)c1cncc2nnc(C34C5C6C3C3C4C5C63I)n12 |
19 | O=C(c1ccc(-c2nnc3cncc(OCCc4ccc(F)c(F)c4)n23)cc1)N1CCOCC1 |
20 | CN(C)c1ccc(C(O)(CO)COc2cncc3nnc(-c4ccc(OC(F)F)cc4)n23)cc1 |
21 | OCC(COc1cncc2nnc(-c3ccc(OC(F)F)cc3)n12)c1ccc(O)cc1 |
22 | Nc1ccc(C(CO)COc2cncc3nnc(-c4ccc(OC(F)F)cc4)n23)cc1 |
I have a question re compounds OSM-S-418, OSM-S-424 and OSM-S-564. I think they all contain a carborane but the connectivity is odd since it consists of a single large ring with Boron/Carbon atoms. In contrast to this, MMV1794644 in the prediction set contains a carborane in the expected cluster form. The carborane of OSM-S-564 may be the same as the one in MMV1794644? If they are the same the representation should be the same.
Apologies for the late reply. The are all carboranes and should be represented most accurately in their cluster forms. The appropriate smiles for these compounds are as follows:
OSM-S-418: FC(F)OC(C=C1)=CC=C1C2=NN=C3C=NC=C(OCCC4567[BH]89%10[CH]%11%124[BH]8%13%14[BH]%11%15%16[BH]%13%17%18[BH]%149%19[BH]%105%20[BH]%21%226[BH]%17%15([BH]%22%12%167)[BH]%18%19%20%21)N32
OSM-S-424: FC(F)OC(C=C1)=CC=C1C2=NN=C3C=NC=C(OCCC456[BH]78[BH]49([BH]%10%118[BH]%129%13%14)[BH]%15%145[BH]%16%17%13[BH]%18%10%12[BH]7%19%11[H-][BH]%19%18%16[CH]%17%156)N32.[Cs+]
OSM-S-564: FC(F)OC(C=C1)=CC=C1C2=NN=C3C=NC=C(OCCC4567[BH]89%10[BH]%11%124[BH]8%13%14[CH]%11%15%16[BH]%13%17%18[BH]%149%19[BH]%105%20[BH]%21%226[BH]%17%15([BH]%22%12%167)[BH]%18%19%20%21)N32
UPDATE: Round 2 has now concluded. Thanks to all who participated! The results announcement can be found here.
OSM will be launching the second round of the predictive modelling competition on August 1st. This will build upon the first round which was run in 2016 (results here). All relevant background can be found in the previous two links and on the Wiki (tab above). Submissions will be allowed up to the end of the day on September 11th.
This aim of the competition is to develop a computational model that predicts new, potent molecules in OSM Series 4.
The target of these molecules is strongly suspected to be PfATP4, since there has so far been essentially a perfect correlation between activity of molecules in this series vs the parasite and in an assay that measured ion regulation, used as a proxy for activity vs PfATP4. PfATP4 is an important target for the development of new drugs for malaria.
We are providing a dataset of actives and inactives. The challenge is to use the data to develop a model that allows us to (better) design compounds in Series 4 that will be active against that target. This competition is part of Open Source Malaria, meaning that everything need to adhere to the Six Laws.
This round of the competition is funded by the AI3SD+ network. Details of the submitted proposal can be found here (#2). The funding allows us to actually make the molecules that are proposed to be active.
Competition Timeline
The Competition OSM will provide:
Submission Rules:
How will entries be assessed? There is a relatively high confidence level that PfATP4 is the molecular target for Series 4 (i.e. compounds that are potent in vitro show disruption of ion regulation in the PfATP4 assay). Therefore, for this round of the competition, we will be focussing on the prediction of active Series 4 compounds (rather than the prediction of any active compounds vs PfATP4) since the two should correlate.
What's the prize Two prizes will be awarded, one for a private sector entry and one for a public sector entry. ...also the opportunity to contribute to our understanding of a new class of antimalarials ...and authorship on a resulting peer-reviewed publication arising from the OSM consortium
*A 'valid' entry is one that stands up to the rigour expected from published in silico models. Judges are entitled to use discretion in the case of unconventional entrants, for example those from people with no formal training such as high school students.
Comments and questions can go below. The above rules/guidance will be periodically updated.