COMPETITION ROUND 2: A Predictive Model for Series 4

edwintse commented 5 years ago

UPDATE: Round 2 has now concluded. Thanks to all who participated! The results announcement can be found here.

OSM will be launching the second round of the predictive modelling competition on August 1st. This will build upon the first round which was run in 2016 (results here). All relevant background can be found in the previous two links and on the Wiki (tab above). Submissions will be allowed up to the end of the day on September 11th.

This aim of the competition is to develop a computational model that predicts new, potent molecules in OSM Series 4.

The target of these molecules is strongly suspected to be PfATP4, since there has so far been essentially a perfect correlation between activity of molecules in this series vs the parasite and in an assay that measured ion regulation, used as a proxy for activity vs PfATP4. PfATP4 is an important target for the development of new drugs for malaria.

We are providing a dataset of actives and inactives. The challenge is to use the data to develop a model that allows us to (better) design compounds in Series 4 that will be active against that target. This competition is part of Open Source Malaria, meaning that everything need to adhere to the Six Laws.

This round of the competition is funded by the AI3SD+ network. Details of the submitted proposal can be found here (#2). The funding allows us to actually make the molecules that are proposed to be active.

Competition Timeline

Competition launch: The competition will run from 01/08/19 to 11/09/19.
Paper write-up: This will happen as the competition is being run and will be submitted to the forthcoming special issue of the Beilstein Journal of Organic Chemistry.
Judging and results: A panel (to be announced) will evaluate the models against an undisclosed test set to determine the model(s) best able to predict activity of knowns.
Synthesis of top compounds: With the best performing model(s) as judged above, the relevant submitters will be asked to suggest new potent Series 4 compounds. These will be synthesised and biologically evaluated to determine the predictive capabilities of the models.

The Competition OSM will provide:

A dataset containing actives and inactive compounds against PfATP4 along with their in vitro potencies (here). This list has been updated to include the more recent Pathogen Box results from the Kirk lab that was used as the test set in the last competition.
The Master Chemical List which contains activity data for all OSM compounds from Series 1-4.
Jeremy Horst's Homology Model built from crystal structures of the closest mammalian homolog (SERCA) PfATP4-PNAS2014.pdb.txt
Details of the relevant mutations known to be associated with resistance.

Submission Rules:

Entries may either be submitted to directly to GitHub (uploaded in the Submitted Models folder in the Code tab above) or be uploaded onto an ELN and a link posted in this repository.
Entrants can work individually or in teams (no limit to team size).
Entrants must work openly during the competition. This doesn't necessarily mean that inputs have to be logged in real time (although that is strongly encouraged), but entries that have not openly deposited working data on a regular basis prior to the deadline will not be accepted. Open Electronic Notebooks (ELN) such as Labtrove or LabArchives can be useful places to post data and work collaboratively. For example, Ho Leung Ng's ELN can be viewed and commented on here. Please note that LabTrove authors are not alerted when a comment is added to an entry so GitHub is a useful place to tag others.
Entrants must agree to their work's incorporation into a future OSM journal publication(s).
Competition winner(s) will be authors on any relevant future paper(s).
Any valid* entries will at least be acknowledged on any relevant future paper(s) and if the contribution is significant may lead to authorship.

How will entries be assessed? There is a relatively high confidence level that PfATP4 is the molecular target for Series 4 (i.e. compounds that are potent in vitro show disruption of ion regulation in the PfATP4 assay). Therefore, for this round of the competition, we will be focussing on the prediction of active Series 4 compounds (rather than the prediction of any active compounds vs PfATP4) since the two should correlate.

For the final submission, entrants will predict the potencies of an undisclosed set of Series 4 compounds (to be provided at a later date)
A judging panel (to be announced) will evaluate these predictions in comparison with experimental data to determine the winner(s)

What's the prize Two prizes will be awarded, one for a private sector entry and one for a public sector entry. ...also the opportunity to contribute to our understanding of a new class of antimalarials ...and authorship on a resulting peer-reviewed publication arising from the OSM consortium

*A 'valid' entry is one that stands up to the rigour expected from published in silico models. Judges are entitled to use discretion in the case of unconventional entrants, for example those from people with no formal training such as high school students.

Comments and questions can go below. The above rules/guidance will be periodically updated.

giribio commented 5 years ago

Interesting, our team started reviewing the previous runs, datasets etc. We hope to have some promising models.

spadavec commented 5 years ago

Very interested in participating, and iterating on the last competition. Is there a formal definition for the core of Series 4? Curious to know where we can enumerate and where we can't.

edwintse commented 5 years ago

@spadavec I think it would be best to stick to the triazolopyrazine core with substitutents in the northwest and northeast positions (e.g. MMV897698 as a simple example) considering the better potencies that we typically get with those.

jsilter commented 5 years ago

Was there ever a full formal writeup for the first round? I see at #538 that it was delayed due to data embargoes and such, hopefully those have passed.

edwintse commented 5 years ago

@jsilter There hasn't been yet, but I am in the process of writing it up on the wiki in this repo so check back there soon. At the same time I am also drafting up this info for the paper (I'll create a new issue about this shortly).

BenedictIrwin commented 5 years ago

Not clear exactly what we should leave in the submitted models folder. Would a prediction for missing values of each compound already in the sheet suffice? Or does it have to be a binary capable of taking a new compound SMILES and outputting the predicted activity?

By working openly, does this mean I can just place my data etc. in a repository e.g. https://github.com/BenedictIrwin/OSM and update that as I make progress?

edwintse commented 5 years ago

@BenedictIrwin Hi, I have added some details about what will be required for submission to the original post above, but it is more the latter. In short, all entrants will be provided with the molecular identifiers (e.g. SMILES) for a set of existing Series 4 compounds (where we have not revealed the experimental potencies) and you will be required to predict the potencies for these compounds.

Yes, working openly means that at any stage, if someone wants to see the progress you've made, the can easily look at your work on an ELN or on Github. Feel free to place your data/working in a repository (either this one or your own) and update/provide links as you make progress.

wvanhoorn commented 5 years ago

Hi, I try to get my head around the provided activity data:

All data is in Google Sheet 'Ion Regulation Data for OSM Competition' (http://tinyurl.com/OSM-Series4CompData)? If this is the case what is the relevance of the Master Chemical List (https://docs.google.com/spreadsheets/d/1Rvy6OiM291d1GN_cyT6eSw_C3lSuJ1jaR7AJa8hgGsc/edit#gid=510297618)?

Re the data in Sheet 'Ion Regulation Data for OSM Competition':

The red/brown highlights indicate missing data and/or structures, i.e. entries that can be ignored?
What is the relevance of the column 'Ion Regulation Activity'? If relevant, what to do with missing data?
Rows 608-835 and 960-1278 do not contain activity data, should these be ignored, treated as prediction set, other?
Is the data in row 836-959 any different from the data in row 2-607? Why is it separate since the first block is sorted by activity?
Do you have a cut-off when to classify a compound as 'active', something like Potency vs Parasite (uMol) <= 1 uM?
There is no test set provided as yet? If we generate the models that we can't share since they run on a proprietary platform (which will most likely be our case) how is model performance compared between entries?

Willem

edwintse commented 5 years ago

@wvanhoorn I'll try to answer these as best as I can.

1) The compounds in the "Ion Regulation Data for OSM Competition" sheet have associated PfATP4 data (i.e. do they have ion regulation activity or not). However, this list contains non-OSM compounds as well. The "Master Chemical List" is the complete list of OSM compounds from Series 1, 3 and 4 with in vitro potencies (n.b. any compound from Series 1 is also known to be inactive against PfATP4). Round 1 of the competition was more focussed on the prediction of active compounds against PfATP4 (not limited to Series 4). For this round, we are looking for predictions for the activities of Series 4 compounds specifically so you can use the Master Chemical List to train your models. 2) Yes, those entries can be ignored. 3) Ion regulation activity indicates whether or not it is active in the PfATP4 assay (1 means the compound shows ion regulation activity, 0 means it doesn't). In the case of Series 4, we see correlation between PfATP4 activity and in vitro potency, so any OSM compound in the list should be relatively potent. Any OSM compound without a number in this column can be found in the Master List and can be used for training the predictions. 4) The compounds in these rows are from the MMV Malaria Box and Pathogen Box and haven't been evaluated against the parasite. Considering that these compounds are all structurally different from Series 4 compounds, I'm not sure how helpful they will be for developing a model to predict the activities of Series 4 compounds specifically, so perhaps it's better to ignore them? 5) No difference. The data in rows 836-959 were just added more recently and haven't been sorted. 6) Generally, our compounds as classified as active if they are <1 uM, weakly active between 1-2.5 uM, and inactive >2.5 uM. 7) Yes, the final test set will be provided at a later data. It's understandable that the model itself won't be able to be shared. We are not focused as much on the actual method, but the accuracy of the prediction. Each submission will need to provide the predicted potencies for this test set. By comparing these predictions with the experimental data for the test set, we can determine which models perform the best. The best model(s) will then be asked to generate new active compounds that will then be synthesised and tested.

Let me know if you have any further questions

mmgalushka commented 5 years ago

Hi,

I'm in the process of creating a dataset containing two fields "SMILES" and "Active/Inactive" status. If I ignore all records where "Smiles" are missing and "Ion Regulation Activity" are neither 0 or 1, I got 576 "clean" compounds (510 - inactive and 66 - active)

Taking into consideration @edwintse comments, may I apply the following rule to records where "Ion Regulation Activity" is missing but "Potency vs Parasite (uMol)" is available?

Rule:

if "Potency vs Parasite (uMol)" < 1:
      "Ion Regulation Activity"  = 1;
else:
     "Ion Regulation Activity"  = 0;

Nick

edwintse commented 5 years ago

@mmgalushka This rule could only be applied to OSM Series 4 compounds since we know there is correlation between ion regulation activity and in vitro potency. I don't think this could be accurately applied to the other compounds from the Ion Regulation sheet since there are lots of different structural classes of compounds for which we don't know if there is any correlation.

spadavec commented 5 years ago

Will the activity of the test molecules be measured as their activity in the ion regulation assay, the Pfal EC50 assay, or both? Do we have known tolerances or errors for either assay?

edwintse commented 5 years ago

@spadavec The test compounds will have been measured in the Pfal IC50 assay only. I'm actually not sure about the specifics of the assay tolerances/errors, but I know that the Pfal IC50 assay uses Mefloquine as the standard control with an acceptable pIC50 range of 7.5-7 if that helps.

mmgalushka commented 5 years ago

I'm not from BioChem background and a little bit lost in different domain-specific terminology. I'm posting a script which I'm using to clean the dataset.

I exported "Ion Regulation Data for OSM Competition" file in TSV format and applied the following script:

Ion_Regulation_Activity = 2
Smiles = 4

with open('datasets/Ion Regulation Data for OSM Competition.csv', mode='w') as w:
    with open('datasets/Ion Regulation Data for OSM Competition.tsv') as r:

        for record in r.readlines():
            fields = record.split('\t')

            activity_value = fields[Ion_Regulation_Activity].strip()
            smiles_value = fields[Smiles].strip()

            if len(smiles_value) > 0 and activity_value in ['0', '1']:
                    w.write(smiles_value + ',' + activity_value + '\n')

The output is CSV file with to columns "SMILES" and "ACTIVITY". I got 851 compounds in total, where 66 is active.

Am I on the right track? Do I need to consider something else?

wvanhoorn commented 5 years ago

I am still confused and it seems that I am not the only one. This runs the risk of becoming a data interpretation/cleaning instead of data modeling competition. Could we therefore settle first on a single file with all relevant data without any irrelevant data (for instance only series 4 compounds if the aim is to only predict series 4 compounds) so that we all depart from the same starting point? And provide a specific description what needs to be modeled. I initially thought the aim was to model 'Potency vs Parasite (uMol)', now it seems it should be 'Ion Regulation Activity' but I am still not sure.

mmgalushka commented 5 years ago

I 100% agree with @wvanhoorn. It would beneficial for all teams to have a single file with samples only relevant to this competition, which containing input feature(s) and a target feature.

edwintse commented 5 years ago

Hi all, Apologies if there has been any confusion. To clarify, the aim of this competition is to predict the Pfal IC50 potencies of Series 4 compounds that are active against PfATP4. This is slightly different to the aim of Round 1 where the aim was more broadly to predict any active compounds against PfATP4.

Both spreadsheets are supposed to be complementary. The idea behind the two are as follows:

Ion regulation spreadsheet We highly suspect PfATP4 to be the target for the Series 4 compounds (potent Series 4 compounds show activity in the ion regulation assay; this is indicated by a 1 in the ion regulation activity column) but the structure of the target protein has not been solved. This means that we don't know what key interactions our compounds are making with the target. The ion regulation spreadsheet contains all known compounds (from many different chemotypes) that have been experimentally evaluated against PfATP4. All of this structural information (along with the provided homology model and relevant mutations) can be used to aid in discerning any key interactions that might be taking place, and therefore be used to predict new potent Series 4 compounds that exploit these interactions.

Master Chemical List This list contains all OSM compounds from Series 1-4 with in vitro potencies with the additional knowledge that Series 1 does not target PfATP4 (i.e. 0 for ion regulation activity). As we are specifically looking for predictions on compounds with a triazolopyrazine core, the changes in structural features between Series 4 compounds and their associated in vitro potencies can be used to develop and refine your models.

n.b. The models will be evaluated for their ability to predict the potencies of a test set that consists of Series 4 compounds only.

With that in mind, you are free to use as much or as little of the provided data that you think will best achieve this goal. I believe that by providing all the data, all aspects can be considered when developing the models.

mmgalushka commented 5 years ago

I try to make the following statements regarding "specifically" my model.

My model takes only one feature (SMILES) as an input. According to your comments am I right to say that we are trying to predict the potencies, which defined in "Ion Regulation Data for OSM Competition" file under the field "Potency vs Parasite (uMol)"? If this is true, our model should predict "real" values.

To summarize above, we need to build a regression model which predicts "Potency vs Parasite (uMol)" by compound "SMILES". Do I make the right conclusion?

PS: I understand that some potency values can be sourced from "Master Chemical List" file, but at this stage, I just want to concentrate on "Ion Regulation Data for OSM Competition" file.

edwintse commented 5 years ago

@mmgalushka Yes, that's correct. Totally fine to just concentrate on the one file at this stage.

mmgalushka commented 5 years ago

Thanks a lot @edwintse!

I used the following Python script to extract records:

Potency_vs_Parasite = 1
Smiles = 4

with open('datasets/Ion Regulation Data for OSM Competition - Malaria Molecules.csv', mode='w') as w:
    with open('datasets/Ion Regulation Data for OSM Competition - Malaria Molecules.tsv') as r:

        for record in r.readlines():
            fields = record.split('\t')

            potency_value = fields[Potency_vs_Parasite].strip()
            smiles_value = fields[Smiles].strip()

            if len(smiles_value) > 0 and len(potency_value) > 0:
                try:
                    float(potency_value) # make sure this is a real value
                    w.write(smiles_value + ',' + potency_value + '\n')
                except:
                    continue

Got the following file;

There are many records which potency exactly 10 and 50, considering that the majority records between 0 and 8.0. Are these values "10s" and "50s" correct?

edwintse commented 5 years ago

Compounds with potency values of 10/50 OR have a potency qualifier of '>' can be treated as inactive. It means that the IC50 values were greater than the max concentration that was tested in the assay.

spadavec commented 5 years ago

@edwintse Thanks for all of the clarification! Just as a follow up, if you consider only S4 compounds that have enough data to contribute to a regression model (e.g. have potency and SMILES strings) there are only ~130 compounds, which is definitely on the low side for an accurate model (typically this number needs to be closer to ~500 for pIC50 values to have an error rate of ~1, which is getting close to on-par with errors in wet measurements of IC50/EC50 values). If we expand the criterion for acceptance to be over/under 1uM (e.g. just a classification job), the accuracy and results should be much better across the board--has that been considered at all for this?

edwintse commented 5 years ago

@spadavec Are these ~130 S4 compounds from the ion regulation spreadsheet that have both potency vs parasite and ion regulation activity? or just potency vs parasite and not ion regulation activity? There should be close to 350 S4 compounds that have potency vs parasite data (all on the Master Chemical List but you have to filter out intermediate structures). Still, this is lower than the desired number of compounds.

It sounds reasonable to me to expand the criterion if that will provide better accuracy/results for the model.

mmgalushka commented 5 years ago

In one of the previous posts, @edwintse wrote:

Generally, our compounds as classified as active if they are <1 uM, weakly active between 1-2.5 uM, and inactive >2.5 uM.

It was regarding "Potency vs Parasite" field in the "'Ion Regulation Data for OSM Competition" file.

at the same time another quote:

Ion regulation activity indicates whether or not it is active in the PfATP4 assay (1 means the compound shows ion regulation activity, 0 means it doesn't).

It was regarding "Ion Regulation Activity" field in the "'Ion Regulation Data for OSM Competition" file.

I selected 2 examples (however there are a number of similar samples in dataset) from the "Ion Regulation Data for OSM Competition" file:

SMILES	Potency vs Parasite (uMol)	Ion Regulation Activity
CC1CN(CC(C)O1)C(=O)c2sc3ccccc3c2Cl	0.0255	0
FC(F)OC(C=C1)=CC=C1C2=NN=C3C=NC=C(OC(C)C4=CC=C(F)C(F)=C4)N32	8.586	1

Question:

Why the first compound with "Potency vs Parasite (uMol)" << 1 is inactive according to "Ion Regulation Activity"?
Why the second compound with "Potency vs Parasite (uMol)" >> 1 is active according to "Ion Regulation Activity"?

I'm sorry maybe I completely miss understand these values. I initially thought there is some relation between "Potency vs Parasite (uMol)" and "Ion Regulation Activity"...

edwintse commented 5 years ago

@mmgalushka Not all compounds in the ion regulation data spreadsheet will show correlation between the two assays.

The first compound CC1CN(CC(C)O1)C(=O)c2sc3ccccc3c2Cl was part of the MMV Malaria Box (indicated by M in the Ion Regulation Test Set Column). This box contained 400 potent antimalarial compounds with many different structures that could be used for further development. Only 28 of the 400 compounds were found to have any ion regulation activity. The remaining 372 can be thought as having a different MoA to PfATP4.

A similar thing can be said for the MMV Pathogen Box compounds (indicated by P in the Ion Regulation Test Set Column). Of the 400 compounds (~120 antimalarials; the rest are for other indications), only 11 were found to have ion regulation activity.

Regarding the S4 compound FC(F)OC(C=C1)=CC=C1C2=NN=C3C=NC=C(OC(C)C4=CC=C(F)C(F)=C4)N32, there is high confidence that there is correlation between potency vs parasite and ion regulation activity, however there were 3 outlier compounds when this batch was evaluated where this relationship didn't hold true (the 3 compounds are shown here).

So essentially, for OSM S4 compounds in general, you can consider this relationship to be correct.

mmgalushka commented 5 years ago

Thanks a lot, @edwintse for this clarification!

I still trying to understand the relation between "Potency vs Parasite" and "Ion Regulation Activity". I collected the following stats based on "Ion Regulation Data for OSM Competition" file:

Potency vs Parasite	Ion Regulation "inactive"	Ion Regulation "Active"
(uMol) <= 1 uM	278	40
(uMol) > 1 uM and (uMol) <= 2.5 uM	124	17
(uMol) > 2.5 uM	27	2

PS: I only considered record where both values available and valid: "Potency vs Parasite" and "Ion Regulation Activity".

I'm struggling to understand even we can create the "perfect model" to predict "Potency vs Parasite", how it can help us to predict "Ion Regulation Activity"?

What I'm trying to say, is that even in the interval (uMol) <= 1 uM we have the challenge to classify Ion Regulation Activity correctly...

edwintse commented 5 years ago

Keep in mind, the predictions that we are seeking are targeted for OSM Series 4 compounds. We are not so much interested in predicting ion regulation activity for Series 4, rather, we want to predict the potency vs parasite since we can make the assumption that active Series 4 compounds against the parasite will be active in the ion regulation assay as well.

mmgalushka commented 5 years ago

Thanks a lot @edwintse! You are absolutely right. I just mixed up again that dataset contains different experiments. I got it now!

wvanhoorn commented 5 years ago

@mmgalushka @edwintse I have taken the file earlier created by @mmgalushka, calculated the InChiKey from the Smiles, joined the OSM master list on InChiKey and kept records where series = 4. This leaves 194 compounds from Series 4 that have a potency value. Do we now finally have the training set for the competition?

https://docs.google.com/spreadsheets/d/1ReZz-_I90YYtiyEJucgj_i_6ckMQ3Rr1ocDvfPsgkOw/edit?usp=sharing

wvanhoorn commented 5 years ago

The posts below seem to contain the prediction set (assuming they are all series 4). Any more to follow? https://github.com/OpenSourceMalaria/Series4/issues/73#issue-481444589 https://github.com/OpenSourceMalaria/Series4/issues/71#issue-469138536

edwintse commented 5 years ago

Yes, those compounds will be used as the prediction set. It seem unlikely that any more will be added considering they would need to be synthesised and sent for testing before the judging occurs. I'll keep you updated if that changes

wvanhoorn commented 5 years ago

@edwintse Could you please also confirm the training set I assembled earlier is correct (with the proviso it only contains series 4 compounds, if people want to include other series the set from @mmgalushka should be used)?

edwintse commented 5 years ago

Yes, that list looks good to me

mmgalushka commented 5 years ago

Thanks, @wvanhoorn for sharing the dataset!

I found the following duplicates:

FC1=C(F)C=CC(C(OC)COC2=CN=CC3=NN=C(C4=CC=C(C#N)C=C4)N32)=C1 (9 duplicates)
FC(F)OC(C=C1)=CC=C1C2=NN=C3C=NC=C(OCC(C4=CC=CC=C4)CO)N32 (2 duplicates)

Question 1: What potency is correct for the first compound?

FC1=C(F)C=CC(C(OC)COC2=CN=CC3=NN=C(C4=CC=C(C#N)C=C4)N32)=C1 -> 0.1105 or 0.207

Question 2: Are these the same compounds?

The differences only in upper/lower cases, but it might be significant from the chemistry point of view.

OC(COc2cncc3nnc(C1CCCCC1)n23)c4ccccc4 OC(COc2cncc3nnc(c1ccccc1)n23)c4ccccc4

c1ncc2n(c1Oc1cc3c(cc1)CCCC3)c(nn2)c1ccc(cc1)OC(F)F c1ncc2n(c1Oc1cc3c(cc1)cccc3)c(nn2)c1ccc(cc1)OC(F)F

c1(ccccc1)CCCc1cncc2n1c(nn2)c1ccc(cc1)OC(F)F c1(ccccc1)CCCC1CNCc2n1c(nn2)c1ccc(cc1)OC(F)F

C1NCc2n(C1CCO)c(nn2)c1ccc(cc1)OC(F)F c1ncc2n(c1CCO)c(nn2)c1ccc(cc1)OC(F)F

wvanhoorn commented 5 years ago

@mmgalushka
Oops. Yes the duplicates were generated by me during the join, I assumed the master list was deduplicated but it's not. I will redo it a bit more carefully and share.

Re 1: a factor of two in measured IC50 or EC50 is normal. In the master list you can find the raw data and see that values easily differ a factor of two or more. The single data points in the competition list are actually averages of multiple measurements in different labs.

@edwintse: this opens another question: it seems that IC50 and Ki data have been averaged even if they are from the same lab. For instance the first entry OSM-A-1: Pfal IC50 (Guy) = 3.05 Pfal (K1) IC50 (Guy) = 4.379 PfaI EC50 uMol (Mean) = 3.7145 (which is the average of the above two numbers) In my understanding IC50 and Ki are two different measurements (relative vs absolute) that can be derived from a single experiment, averaging these two does not make sense since it is averaging the same single observation interpreted differently.

And another issue: OSM-S-35 has duplicate entries for a single assay, at least that is how I interpret the semi-colons. However, in the spreadsheet this equates to a string, not a numerical value and these are all ignored in the final average: Pfal IC50 (GSK) = 0.036; 0.012 Pfal IC50 (Avery) = 0.026; 0.038 Pfal IC50 (Ralph) = 0.011 PfaI EC50 uMol (Mean) = 0.011 (average of the above 5 numbers = 0.0246)

Looks like there is some more data cleaning to do (which normally is 90% of the effort of building a model so not too bad so far).

Re 2: these structures are different! Lower case represents aromatic atoms, upper case aliphatic.

wvanhoorn commented 5 years ago

My hopefully last attempt: I have taken the master list as starting point since that seems to contain the original data. All work was done on a snapshot downloaded today (20 Aug 2019).

The columns 'PfaI EC50 uMol (Mean) Qualifier' and 'PfaI EC50 uMol (Mean)' were removed
Rows without Smiles were removed as well as rows without Pfal data. The latter means that at least one the remaining columns starting with 'Pfal' had to contain a value.
The molecular structures were normalised: salts stripped, canonical tautomer calculated, charges normalised, etc.
Rows were merged by (recalculated) InChiKey.
Activity data was pivoted into columns 'Assay', 'Value' and 'Qualifier'. Activity values that were not IC50 like '100% at 40 micromolar' were removed as well as values that did make sense like '0'. The original Pfal columns were left in place so that it can be seen where each data point comes from. The file was split on the three new columns so that 1 row = 1 value. During this process all other columns were copied so there is redundancy. I leave it to each individual if and how they want to average multiple values for a single compound.
Series annotation was done again since not all compounds claimed to be from series 4 contained the 'triazolopyrazine core with substitutents in the northwest and northeast positions' mentioned before, see https://github.com/OpenSourceMalaria/Series4_PredictiveModel/issues/1#issuecomment-518211204. When the original series annotation was '4' but the compound contains another core (or does not have two substituents in the right position) the Series annotation is overwritten as 'not4'. Note that all series are still there, leaving it open whether or not to include data from other series.

Result is Master Chemical List - annotated

edwintse commented 5 years ago

@wvanhoorn To answer your previous question, both are IC50 measurements but are of different strains of the parasite. Most IC50 values are against the NF54 or 3D7 strain. This Pfal (K1) IC50 (Guy) is against the K1 strain which is multi-drug resistant. A description of the column titles can be found here. You would be correct in that averaging these two measurements is not accurate since they are from different strains. I think the PfaI EC50 uMol (Mean) column was just meant to the relevant potencies.

Yes, those are duplicate entries from the same assay which should be included in the average but aren't.

mmgalushka commented 5 years ago

@wvanhoorn Thanks a lot for cleaning this data! It is much easier to use them.

@edwintse This is a general question about building the regression model. I understand that ideally, we would like to have a model which correctly predict a potency value in any range. However, if the predicted potency value is grater than 1uM does it has any practical meaning in your research. For example, does it make any difference if the predicted potency of a compound >10 or > 25 or > 50...?

edwintse commented 5 years ago

I guess it would be a lot less useful to be able to predict inactive compounds accurately so we wouldn't really distinguish between >10, >25 and >50 as being any different

mmgalushka commented 5 years ago

So would it be correct to say that accuracy of the regression model above a certain threshold (let say 10) does not matter?

edwintse commented 5 years ago

Yes, that sounds reasonable

mmgalushka commented 5 years ago

@edwintse In the provided dataset their ara many records where potency values are ">10" or ">25"... How you are going to evaluate predictions of the submitted models if your experimental results show ">10" or ">25"... outcomes.

For example, one (submitted) model predicted potency value 12, another model predicted 20 and your experimental result showed ">10". Which model is more accurate?

Maybe we should introduce a potency threshold after which the predicted results should be treated the same...

edwintse commented 5 years ago

@mmgalushka As you mentioned before, the ability to differentiate between inactive compounds is not entirely necessary. For instance, if one model predicted a compound to have a potency of 12 uM while another predicted it to be 20 uM, the actual experimental results would depend on the max concentration that the assay was run in (i.e. if the compound gets made and tested, it would return a result of >10 uM regardless).

The accuracy of the models will therefore be determined based on the ability to predict active compounds, say <2.5 uM.

The assay that we currently use has a max concentration of >25 uM, however I would say that much of this upper range is not terribly useful. So perhaps an upper threshold of >10 uM would suffice.

mmgalushka commented 5 years ago

Thanks, @edwintse for the clarification! Am I right to say that submission result would look like this:

Compound	Potency
c1ncc2n(c1Oc1cc3c(cc1)cccc3)c(nn2)c1ccc(cc1)OC(F)F	0.023
CCOc1ccc(cc1OCC)c2nonc2NC(=O)c3cccc(C)c3	0.453
CCOC(=O)c1ccc2nc(cc(O)c2c1)c3ccccc3	>2.5
CCOC(=O)C1=CN(CC)c2cc(N3CCCCC3)c(F)cc2C1=O	1.23
[O-]N+c1ccc(C=NNc2nc3ccccc3[nH]2)cc1	>2.5
...	...

Note: This is dummy potency values. I use it as an example.

So we indicate the potency value up to 2.5 and everything above it just indicated as ">2.5".

mmgalushka commented 5 years ago

@edwintse You replied on this post that compounds for prediction are more likely come from two sources.

Do you have the final set of compounds, which we need to predict? Or it will be provided later?

edwintse commented 5 years ago

I will finalise the test set of compounds for the competition early next week and will post it here.

mmgalushka commented 5 years ago

Maybe this data would be useful to someone. I tried to visualize "training" compounds in Series 4 together with "target" compounds announced in #71 and #73 for my research (see visualizations below).

None: In order to do this visualization, I standardized each SMILES (using MolVS) and converted it into a "fingerprint" (using variational autoencoder trained on ChEMBL v23).

Visualization for compounds S4 with compounds 71 is in here

#	Smiles
0	Fc1ccc(CCOc2cncc3nnc(-c4ccc(C(F)(F)F)nc4)n23)cc1F
1	Clc1ccccc1CCOc1cncc2nnc(-c3ccncc3)n12
2	FC(F)Oc1ccc(-c2nnc3cncc(OCCC4COC4)n23)cc1
3	OCC(COc1cncc2nnc(-c3ccc(OC(F)F)cc3)n12)c1ccccc1OCc1ccccc1
4	OCc1ccc(COc2cncc3nnc(-c4ccc(OC(F)F)cc4)n23)cc1
5	c1ccc(CCOc2cncc3nnc(C4CCNCC4)n23)cc1
6	COc1ccc(CCOc2cncc3nnc(-c4ccc(OC(F)F)cc4)n23)cc1
7	COc1ccc(CCNc2cncc3nnc(-c4ccc(OC(F)F)cc4)n23)cc1OC
8	OCC(COc1cncc2nnc(-c3ccc4cc[nH]c4c3)n12)c1ccccc1
9	OC@Hcc3)n12)c1ccccc1
10	FC(F)Oc1ccc(-c2nnc3cncc(SCCc4ccccc4)n23)cc1
11	O=C(O)C(COc1cncc2nnc(-c3ccc(OC(F)F)cc3)n12)c1ccccc1
12	FC(F)Oc1ccc(-c2nnc3cncc(OCCOc4ccccc4)n23)cc1
13	O=C(Nc1cccc(Cl)c1)c1cncc2nnc(-c3cccnc3)n12
14	COc1ccc(-c2nnc3cncc(OCCc4ccccc4)n23)cc1

Visualization for compounds S4 with compounds 73 is in here

#	Smiles
0	Fc1ccc(CCOc2cncc3nnc(-c4ccc(C(F)(F)F)nc4)n23)cc1F
1	Clc1ccccc1CCOc1cncc2nnc(-c3ccncc3)n12
2	FC(F)Oc1ccc(-c2nnc3cncc(OCCC4COC4)n23)cc1
3	OCC(COc1cncc2nnc(-c3ccc(OC(F)F)cc3)n12)c1ccccc1OCc1ccccc1
4	OCc1ccc(COc2cncc3nnc(-c4ccc(OC(F)F)cc4)n23)cc1
5	c1ccc(CCOc2cncc3nnc(C4CCNCC4)n23)cc1
6	COc1ccc(CCOc2cncc3nnc(-c4ccc(OC(F)F)cc4)n23)cc1
7	COc1ccc(CCNc2cncc3nnc(-c4ccc(OC(F)F)cc4)n23)cc1OC
8	OCC(COc1cncc2nnc(-c3ccc4cc[nH]c4c3)n12)c1ccccc1
9	OC@Hcc3)n12)c1ccccc1
10	FC(F)Oc1ccc(-c2nnc3cncc(SCCc4ccccc4)n23)cc1
11	O=C(O)C(COc1cncc2nnc(-c3ccc(OC(F)F)cc3)n12)c1ccccc1
12	FC(F)Oc1ccc(-c2nnc3cncc(OCCOc4ccccc4)n23)cc1
13	O=C(Nc1cccc(Cl)c1)c1cncc2nnc(-c3cccnc3)n12
14	COc1ccc(-c2nnc3cncc(OCCc4ccccc4)n23)cc1
15	Fc1ccc(-c2nnc3cncc(OCCc4ccccc4)n23)cc1
16	O=C(COc1cncc2nnc(-c3ccc(OC(F)F)cc3)n12)c1ccccc1
17	CCN(CC)C(COc1cncc2nnc(-c3ccc(OC(F)F)cc3)n12)c1ccc(F)c(F)c1
18	O=C(Nc1ccnc(C(F)(F)F)c1)c1cncc2nnc(C34C5C6C3C3C4C5C63I)n12
19	O=C(c1ccc(-c2nnc3cncc(OCCc4ccc(F)c(F)c4)n23)cc1)N1CCOCC1
20	CN(C)c1ccc(C(O)(CO)COc2cncc3nnc(-c4ccc(OC(F)F)cc4)n23)cc1
21	OCC(COc1cncc2nnc(-c3ccc(OC(F)F)cc3)n12)c1ccc(O)cc1
22	Nc1ccc(C(CO)COc2cncc3nnc(-c4ccc(OC(F)F)cc4)n23)cc1

wvanhoorn commented 5 years ago

I have a question re compounds OSM-S-418, OSM-S-424 and OSM-S-564. I think they all contain a carborane but the connectivity is odd since it consists of a single large ring with Boron/Carbon atoms. In contrast to this, MMV1794644 in the prediction set contains a carborane in the expected cluster form. The carborane of OSM-S-564 may be the same as the one in MMV1794644? If they are the same the representation should be the same.

edwintse commented 5 years ago

Apologies for the late reply. The are all carboranes and should be represented most accurately in their cluster forms. The appropriate smiles for these compounds are as follows:

OSM-S-418: FC(F)OC(C=C1)=CC=C1C2=NN=C3C=NC=C(OCCC4567[BH]89%10[CH]%11%124[BH]8%13%14[BH]%11%15%16[BH]%13%17%18[BH]%149%19[BH]%105%20[BH]%21%226[BH]%17%15([BH]%22%12%167)[BH]%18%19%20%21)N32

OSM-S-424: FC(F)OC(C=C1)=CC=C1C2=NN=C3C=NC=C(OCCC456[BH]78[BH]49([BH]%10%118[BH]%129%13%14)[BH]%15%145[BH]%16%17%13[BH]%18%10%12[BH]7%19%11[H-][BH]%19%18%16[CH]%17%156)N32.[Cs+]

OSM-S-564: FC(F)OC(C=C1)=CC=C1C2=NN=C3C=NC=C(OCCC4567[BH]89%10[BH]%11%124[BH]8%13%14[CH]%11%15%16[BH]%13%17%18[BH]%149%19[BH]%105%20[BH]%21%226[BH]%17%15([BH]%22%12%167)[BH]%18%19%20%21)N32

edwintse commented 5 years ago

I have just posted the final test set compounds for the competition in a new issue (here).

HOWEVER, please be aware that there is a high chance that the competition deadline will be extended past September 11th

I will update as soon as I find out the exact details.

OpenSourceMalaria / Series4_PredictiveModel

COMPETITION ROUND 2: A Predictive Model for Series 4 #1

UPDATE: Round 2 has now concluded. Thanks to all who participated! The results announcement can be found here.