OpenSourceMalaria / OSM_To_Do_List

Action Items in the Open Source Malaria Consortium
82 stars 13 forks source link

Master list consolidation #533

Open cdsouthan opened 7 years ago

cdsouthan commented 7 years ago

Updating, checking and optimising the Master List (ML)

https://docs.google.com/spreadsheets/d/1Rvy6OiM291d1GN_cyT6eSw_C3lSuJ1jaR7AJa8hgGsc/edit#gid=510297618

This is a general task but crucial for the upcoming Series 4 paper in particular. Ideally this ML should be migrated/transformed into a small open database but that's a task for the future. I hope the suggestions below do not come across as pedantically over-prescriptive (a.k.a. council of perfection) but they are based on the many quirks/foibles/gotchas I have had fun (mostly, but some exasperation also) ferreting, divining and grappling with over the years in both databases and papers (see https://cdsouthan.blogspot.se/). Many of us will need to pitch in here but I had assigned it to our esteemed first-author since it is crucial not only for the nascent paper but also any subsequent ones.

JFTR I do not want to take the responsibility for actually editing the sheet. This is better done by those directly engaged with making the structures and generating the data (even as inputs from collaborators). I also suggest team members do such editing in pairs to cross-check inputs and changes (cup of coffee job?) and that senior authors keep abreast of how things are going on the ML front.

Aspects of the ML have come up in previous posts concerning Google indexing https://github.com/OpenSourceMalaria/OSM_To_Do_List/issues/511 and direct visualisation https://github.com/OpenSourceMalaria/OSM_To_Do_List/issues/515

cdsouthan commented 7 years ago

The order below is my ad hoc ranking but AWK may not correspond to the likelihood or time for fixes. I'm kicking off with just outlines that can be expanded if clarification details are needed.

  1. Assays: the biggest scientific issue is whole parasite assay heterogeneity across the series. The fact that we get activity results even within roughly same order of magnitude between different strains (e.g. K1 vs D7) and different labs across the globe is actually reassuring. However, we cannot generate reproducible SAR by inter-assay comparison or averaging. This is obvious from comparing the figures that show 2 to 5-fold inter-assay differences (that are probably systematic). I thus suggest the auto-average column be removed since it is frankly perilous. This serious methodological problem is likely to be jumped on by referees. We thus need at least the front runners all to be re-run in the same lab (ask the Avery Group for those solid 21-point IC50s?). This is likely to result in some re-shuffling of the SAR conclusions. Other aspects to consider include the variable threshold cut-offs for "inactive" viz 10, 40 or 50 uM. It would also be good if our collaborators provided more data on intra-assay variance via the technical replicates (these can go into the sheet even though we probably can't extend to +/- ranges throughout the paper). We should also ask those folk nicely if they could round-down to at most 3 sig figs at their end, (although the realistic experimental variance is way higher than implied from 3 figs) rather than us doing it this end (this can result in "fuzzy duplicates" in the public domain). The issues above need to be resolved before we make new submissions to PubChem BioAssay or ChEMBL.

Addendum 15/08: While the arguments above still stand (but are open to discussion as ever) I just re-read our esteemed S1 paper https://www.ncbi.nlm.nih.gov/pubmed/27800551. This revealed that not only did the drafters of the supp dat sections do a good job of partitioning the inter-assay SAR but they also included some honest notes on variance issues. Ipso facto, while I don't remember details of the referees comments, we thus did make it in, despite a certain amount of "ducking and weaving" through the heterogenous assays. Notwithstanding , being a couple of years further on we know now it would be more rigorous to re-run all the S4 SAR in one robust, standardised assay (NOBA our large PubChem BioAssay submission would be based on this normalised data). This also likely to concomitantly clamp down the intra-assay variance. Note that this in no way precludes contributors from the different assays so admirably supporting the S4 work being on the author list, even if only the re-run data ends up being included.

  1. Gaps: There are holes and ambiguities of various types for some of the structure records e.g. what to do about record 50, missing IK in 155 ect. Resolving these should be a priority. This should include referential integrity checking the whole sheet (i.e. finding any gaps where we expect complete occupancy). NOBA, I have seen various type of corruption when I make a local MS Excel copy followed by sorting (e.g. by series) but maybe these are my local Windoze/MS, word wrap problems related to InChI extensions. Note also as we move things around between master sheet mirrorings, wikis and other instantiations we need to be vigilant about gaps or other changes generated during such propagations. Examples have already been spotted as in https://github.com/OpenSourceMalaria/Series4/issues/1

  2. Round tripping: Designers/synthesisers of new (or old) structures should do their own "round tripping" checks (and document in their ELNs at least) to ensure that the SMILES and InChI string interconvert and spawn the same InChIKey. Then Goolgle the inner layer of the Key (https://www.ncbi.nlm.nih.gov/pubmed/23399051) and, on a good day, you should hit your own open ELN (full key searches can give false +ves). Check also for unambiguous renderings in ChemDraw as well as Marvin. NOBA when they are in, or get to, PubChem they should look the same in their CID renderings (not just our SIDs). Note also any isomers will need isomeric SMILES, not the canonical (flat) ones. JFTR an issue of SMILES not converting has already been picked up in a recent batch of compounds whose results have come back https://github.com/OpenSourceMalaria/OSM_To_Do_List/issues/517#issuecomment-315574384. We also need to think carefully about ID assigning for purified (or at least highly enriched) isomeric splits (e.g. for potentially three entries R, S and flat, possibly with three different activities). I'm not sure where we are with SD files and IUPAC strings back in the ELNs (for InChI/SMILES round-trip checking, Chemicalize, the ChemAxon free tool, is useful since it converts the sheet as a webpage but its glitching for me right now)

  3. Intermediates: someone should please tag reagents and purified intermediates in a new column (just "int" would do) so they can be cleanly subsetted. Notwithstanding, to be prudent, all stable , purified and soluble intermediates should be run through the assay - just in case - (we wouldn't want to miss out on FBD opportunities would we?)

  4. PubChem CIDs: (see https://github.com/OpenSourceMalaria/OSM_To_Do_List/issues/528) For the paper we need a full house for the S4 leads at least (note some of our published S1s are still missing). I can help with this but I could do with someone assisting me with (externally first) list joining inside the sheet for what we already have. As of 9th Aug we only have 167 CID matches out of 320 S4 rows in the sheet. Making a rough cut of intermediates below 270-300 it looks like only ~ 130 are leads. We thus have a big shortfall.

  5. Synonyms: We need to keep on top of internal "synonym spaghetti" or ID <> struc mapping errors can creep in and propagate (not only just inside the project but also across the globe in fact). These can can be compounded by external database name handling rules and become very difficult to unravel and retro-fix. I guess we are defaulting to OSM nos as the primary IDs but these need to be completed for the gaps. including the new MMVs. It might have been better to standardise the whole OSM series to 3-digits (i.e. 001 rather than 1) but unless this can be transitively propagated backwards into the originating ELNs we may have to live with it.

  6. Reproducability This is more an add-on rather than a core requirement. AWK there is a big palaver these days (correctly so) about reproducibility across the experimental biological sciences. As an open project we have a unique advantage to be able to directly test various aspects of this and surface the results, even if some of what we find may not be such good news. There are some cases in the project of re-synthesis and re-testing of a compound (e.g. https://github.com/OpenSourceMalaria/OSM_To_Do_List/issues/525) . This type of approach would go down well as adjunct data in our paper, especially since it is hardly ever addressed (any known examples in print?)

  7. Virtual vs real I dont know if/where such a rule is explicit but I assume all the ML cpds are a) in a "pot" somewhere and b) the analytical data unequivocally supports the specified structure. I also notice there is a lot of designing going on, which is groovy of course, with I guess at least an eye on synthetic feasibility/tractability. As ever it's not a bad idea to pop these structures and substructures against PubChem, just in case they are close to something. I have mentioned elsewhere there are good arguments for submitting these virtuals to PubChem in advance of anyone actually getting them into a pot (and thence to testing). The big advantage is the automatic 2D/3D clustering that all new CIDs undergo. This means you can perceive/visualise the designs within the chemical space of our own and other extant molecules. NOBA, when the results come in the project already has the CIDs to put in the ELN, ML, progress reports and of course papers (n.b. I can advise on the SIDs submissions, which can be clearly tagged as virtuals in the first instance). The small numbers of virtuals we are talking about will be no problem for PubChem, that contains millions of virtuals/MODs anyway (long story),

  8. Isomerism We also need to think carefully about assigning distinct new IDs for purified (or at least highly enriched) isomeric splits (e.g. for potentially three entries R, S and flat). Note we could consequently possibly detect three different activities if the potency measurements are accurate enough to split these outside the error ranges.

cdsouthan commented 7 years ago

It has been brought to my attention (in a friendly way of course) that in order to get some engagement on the big themes above, I should split the objectives down to bite-size chunks. While I don't think the esteemed collaborators around here really need things to be Mickey-Moused we can try some chunking and see if that gets any traction.

1) Could someone (@david1597 perhaps) please filll in the missing OSM numbers in the sheet?

david1597 commented 7 years ago

Sure. Breaking this down will be the easiest way to go.

I'll start where you suggest and get the OSM numbers in the sheet.

mattodd commented 7 years ago

Agreed. To contain the task I'd recomment focussing on the Series 4 compounds for now. Also @cdsouthan just re the distinction between "final" compounds and synthetic intermediates. We have this already (fully - I think - for Series 4, less so for others) in that all biologically evaluated compounds have MMV numbers. Can we just use that as the filter?

david1597 commented 7 years ago

OK, everything in the sheet now has its OSM number. There were a couple of Sydney compounds, around 20 Edinburgh compounds and around 100 inherited compounds. The inherited compounds were designated 'X', as we're not sure where they were synthesised.

cdsouthan commented 7 years ago

Good, While I woke up in the night wondering if adding a synonym was such a good idea after all (seeing as a general principle we want as few as possible) - I hope it helps in the end that "we" have our full set of primary identifier ducks lined up. OK, so mystery compounds seem somewhat paradoxical in open drug discovery (but lets hope the referees overlook that... ). Moving swiftly on then;

Re 2 above; please do a referential integrity check that should include but is not restricted to a) absence of gaps for all the molecular specifications b) no corruptions c) no duplicates in any columns d) check the sheet behaves itself for common operations e.g. text <> CSV < > Excel <> Libre Office (or whatever) and that column sorting and setting tables also work OK. Also archive a copy-just in case.... As ever @david1597 try to find an OSM friend to do this tedious but crucial x-checking with you ( @mattodd will buy them a beer and maybe a pizza also for you)

n.b. 01 Adding fresh synonyms raises the back-propagation issue. At some time it wbgood to arrange that a named MMV person has a copy of our optimised sheet so they are on the record as being in possession of these new n2s mappings for their compounds

n.b. 02 I take @mattodd s point if the MMV filter gives a clean cut from isolated intermediates - but why not run them in the assay anyway? We can then simply push the entire S4 structure set to PubChem at some point

david1597 commented 7 years ago

The integrity checks will likely come as we write up the experimental for the series 4 paper - @edwintse and myself should hopefully be getting through these in the near future.