OpenSourceMalaria / OSM_To_Do_List

Action Items in the Open Source Malaria Consortium
82 stars 13 forks source link

Assigning OSM Codes to Chiral Compounds #172

Open mattodd opened 10 years ago

mattodd commented 10 years ago

We need to assign OSM codes to inherited compounds that currently only have MMV codes, since compounds in the project need to have their data collected together. I just did the first:

http://malaria.ourexperiment.org/osm_procedures/9557/Preparation_of_OSMS175.html

But in doing the second (MMV669844) I hit a snag. Raising these questions:

1) Should enantiomers have different compound IDs? Assuming yes 2) Should a racemate have a further, different ID? Assuming yes 3) What about the continuum of scalemic possibilities - how do we number these? Do we worry? 4) What do we do for samples which are supposedly enantioenriched (i.e. prepared with a stereoselective reaction) but for which the enantiomeric excess has not been measured?

These questions are relevant here since MMV669844 was prepared (by a CRO) with an asymmetric reaction, so is shown with the expected stereocentre, but the ee was not measured. How do we number this? I'm assuming the answer is that we give unique numbers to enantiomers and racemates, and simply include asterisks when we're not sure of the ee, but I'd be interested to hear from @cdsouthan @madgpap and @murrayfold who will have dealt with this in the past.

march 2014 herg compounds

This paper (http://www.jcheminf.com/content/4/1/11) adopts the molecule-substance-batch approach, so we could use OSM-S-XXX-Y-Z if we had to, but we've not been doing this to date.

murrayfold commented 10 years ago

1) Should enantiomers have different compound IDs?

I'd say yes.

2) Should a racemate have a further, different ID?

Again, I'd say yes

3) What about the continuum of scalemic possibilities

No, I think we decide on an ee threshold, anything below this is considered racemic.

4) What do we do for samples which are supposedly enantioenriched (i.e. prepared with a stereoselective reaction) but for which the enantiomeric excess has not been measured?

Same as 3 I think or it just becomes a minefield of numbers and as has been shown with PZQ project ee is not directly proportional to activity.

cdsouthan commented 10 years ago

Agreed, this is analogous to the PubChem rules. If you read our MIABE recomendations (http://www.ncbi.nlm.nih.gov/pubmed/21878981) there was a comment published on (our ommsion of) enanatiomeric purity. I agree the ee experimental results should be ELN captured. Here again the project runs into the classic med chem issues for which at least aproximate solutions have been evolved over the years. This needs not only sample management but also physical tracking (barcoding?) of ee baches that could vary in activity. The problem of splitting the MMV numbers remains, for which suffixes could be a pragmatic solution (e.g. A and B).

On Mon, Apr 7, 2014 at 10:26 AM, Murray Robertson notifications@github.comwrote:

1) Should enantiomers have different compound IDs?

I'd say yes.

2) Should a racemate have a further, different ID?

Again, I'd say yes

3) What about the continuum of scalemic possibilities

No, I think we decide on an ee threshold, anything below this is considered racemic.

4) What do we do for samples which are supposedly enantioenriched (i.e. prepared with a stereoselective reaction) but for which the enantiomeric excess has not been measured?

Same as 3 I think or it just becomes a minefield of numbers and as has been shown with PZQ project ee is not directly proportional to activity.

Reply to this email directly or view it on GitHubhttps://github.com/OpenSourceMalaria/OSM_To_Do_List/issues/172#issuecomment-39706240 .

mattodd commented 10 years ago

So we use distinct IDs for enantiopure and rac. If a compound is suspected of being enantioenriched (because it has been prepared with an enantioselective reaction) we assign the same number as the enantiopure material? If the enantiomeric excess has been measured we append with a number, e.g. OSM-S-175-78?

What about having OSM-S-175-E to indicate a likely but indeterminate enantiomeric excess? No compound is ever enantiomerically pure, so we should always have some suffix describing enantiopurity, no, given how important this is for activity? Counter to this argument, the biological activity difference between rac and enantiopure is rarely that significant unless one is unlucky (vs activity of two enantiomers) so we need to limit the effort here.

Batches are easily captured using sample IDs used in the lab notebooks, i.e. with a second number that involves the chemist's initials. We ought to include such things in the biological screening data sheets as a separate column, but I think there's no need to make the compound ID more cumbersome with that.

alintheopen commented 10 years ago

I think in terms of enantioenriched materials, we should only describe things as a single enantiomer if they are say >95% or >90% and otherwise we should view them as racemic as we can't comment on them with any degree of accuracy. I'm not sure that putting another letter 'OSM-S-175-E' is a good idea. I think it might confuse rather than clarify.

mattodd commented 10 years ago

Another point - am I right in thinking that SMILES for enantiomers are different, while InChI is the same? We want to avoid the situation where someone searches one of the project-related spreadsheets for a molecule with a defined stereocentre and misses the racemate.

murrayfold commented 10 years ago

I think it depends how they are generated. Both have the ability to assign stereocenters but its maybe not always the case.

madgpap commented 10 years ago

Both representations distinguish between enantiomers or non-chiral/racemate versions. See for example https://www.ebi.ac.uk/chembl/compound/inspect/CHEMBL175 vs. https://www.ebi.ac.uk/chembl/compound/inspect/CHEMBL521 Most modern chemistry databases (including ChEMBL and UniChem) implement 'flexible' (= stereo invariant) searches.

mattodd commented 10 years ago

OK, so we need not worry about strings if the databases are happy with searches where the stereocentre configuration is variable (i.e. a search for a structure with an undefined stereocentre returns search results that include racemates and enantiopure compounds. We still have a decision to make on compounds with 1) A measured ee value that is intermediate, e.g. 70% 2) Suspected enantiomeric excess but which has not been measured (e.g. compound prepared with the intention of giving an ee, the outcome of which has not been measured.

The scientifically appropriate thing to do with such compounds is to classify them as racemic (as Alice suggests), since highly enantioenriched compounds are special. But when one is searching for examples of enantioenriched compounds one presumably wants to see all the examples where people have attempted to generate an ee as well (including examples where people obtain modest enantioenrichment), meaning it would be more useful to classify a "hopeful" ee along with the "known" enantioenriched samples by giving them the same codes.

mattodd commented 10 years ago

Referenced in agenda of July 2014 meeting http://malaria.ourexperiment.org/osddmalaria_meeting_/10142/OSM_Online_Project_Meeting_8_3rd_July_2014.html since this arose again - Alice has made OSM-S-208, a racemate. Clearly anyone interested in this compound will also be interested in the same structure, MMV669844, but as it stands these two ought to have different codes since MMV669844 was synthesised as a racemate, though its ee was not measured, so we don't know whether it's racemic, even. Unless these two compounds are somehow linked, it will be easy to lose connections between data.

To my mind is still seems neatest to have a suffix for chiral compounds OSM-S-208 for racemic, synthesised as racemic and therefore racemic OSM-S-208E for enantioenriched, where the level of enantioenrichment, or its sense, is unknown OSM-S-208S and OSM-S-208R for compounds with established enantiomeric excesses (and the sense of that excess).

This preserves differences of code for rac, scalemic, enantiopure but maintains an obvious connection between the samples. Thus we'd be able to have all these compounds grouped in the same compound page for OSM-S-208. Isn't that what you'd like to see when you go here:

http://malaria.ourexperiment.org/osm_procedures/9907/OSMS208.html

Compounds synthesised non-racemically could be assigned the E code until they are measured at which point they could be assigned the other codes. We could, if people wanted, assign a threshold of e.g. 80% ee that distinguishes racemic from enantiopure. That's less important in my view.

Would we be committing informatics-cide by having some codes longer than others @cdsouthan ?

cdsouthan commented 10 years ago

Hmm, JFTR, I don't accept the mantle of "database policeman" (its futile anyway) but I can merely state what empirically makes findability and searchability in support ot this project, easy or diffcult from where I sit. I thus suggest code-splitting/forking via suffixes or any other extension in your public identifieres is not a good idea (yet) since there is no precedent for e-codes. It is simpler just to stick to "flat" or R/S and E/Z resolved as best represents and fits the analytical data. They would also get InChIs for what you have actually made (and someone else could make) and all the isomers are x-mapped via PubChem "same connectivity" anyway. Internaly, (but still "open" in the Google sense) you can obviously do whatever has utility in you internal registration system e.g. adding synthesis batch nos and ee numbers as code extensions. However, I suggest they only beome valuable externaly if they robustly split the bioassay data (e.g. your ee batches give significantly different IC50s). If this proves to be so, it does then raise the interesting precedent of splitting the external ID via suffix codes in the same way, but also complications for public assay result mapping with the "same" strucutres (but could add the code in the SID records)

murrayfold commented 10 years ago

The only problem with the "E" suffix is what if the compound has a double bond of unknown/mixture of configurations?

I accept this is not likely, but is possible

mattodd commented 10 years ago

Perfectly reasonable point, yes, but the issue is whether suffices are palatable. So we could pick another letter. Again, for me, it's the fact that there is probably an ee, but we don't know what it is. This, to me, is a different class of compound from one where we know the compound is rac (since we made it that way) vs one where we have measured the ee and can specify what it is. It's uncertain, like Schroedinger's cat before you look at it. The value of a tag is to highlight the uncertainty. We don't have other forms of uncertainty - for example of purity, since the assumption is made that a compound is purified before we're interested in it. But ee can range from 0 to 100, so the uncertainty range is large.

Again, there will rarely be times when this is crucial, given the likely differences in biological activity between rac and enantiopure. It's more about capturing something to do with the synthesis, or the intent of the synthesis, that might be useful. If the cost of introducing a suffix is large, then we don't need to do it, but I'm struggling to see what's hard about using a suffix for stereochem. The benefit is that we retain the same numerical code.

On 29 June 2014 13:29, Murray Robertson notifications@github.com wrote:

The only problem with the "E" suffix is what if the compound has a double bond of unknown/mixture of configurations?

I accept this is not likely, but is possible

— Reply to this email directly or view it on GitHub https://github.com/OpenSourceMalaria/OSM_To_Do_List/issues/172#issuecomment-47462021 .

MATTHEW TODD | Associate Professor School of Chemistry | Faculty of Science

THE UNIVERSITY OF SYDNEY Rm 519, F11 | The University of Sydney | NSW | 2006 T +61 2 9351 2180 | F +61 2 9351 3329 | M +61 415 274104 E matthew.todd@sydney.edu.au | W http://sydney.edu.au/science/chemistry/research/todd.html | W http://opensourcemalaria.org/

CRICOS 00026A This email plus any attachments to it are confidential. Any unauthorised use is strictly prohibited. If you receive this email in error, please delete it and any attachments.

cdsouthan commented 10 years ago

Its not realy a cost issue, its more you just need to think out how your internal registration rules can mesh with the external chemistry rules (i.e. in PubChem). As said you can internally suffix and fork off your "core" codes as much as you like, with as many layers as you like. Its just when you go external I'd be circumspect about splitting database submissions with suffixes, since they will merge to the same CIDs by default (n.b. you'll have to fit ChEMBL rules first and them PubChem rules). JFTR here is an example of antimalarial assay data being split by SID https://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=52949030&loc=ec_rcs

mattodd commented 10 years ago

OK, so wait - in ChEMBL rac and enantiopure (above a threshold of ee) have different codes, correct? How is a compound numbered that has a non-determined ee? Same as rac?

In OSM most things are external. We have codes for batches, from the lab book, and these codes are different to the OSM codes. The batch codes are XX-Y-Z where X are person initials, Y is reaction code for a given S/M going to a given product regardless of reagents, and Z is the attempt number. The Y number could be given a suffix if a reaction is being done asymmetrically.

On 30 June 2014 18:22, cdsouthan notifications@github.com wrote:

Its not realy a cost issue, its more you just need to think out how your internal registration rules can mesh with the external chemistry rules (i.e. in PubChem). As said you can internally suffix and fork off your "core" codes as much as you like, with as many layers as you like. Its just when you go external I'd be circumspect about splitting database submissions with suffixes, since they will merge to the same CIDs by default (i.e. you'll have to fit ChEMBL rules first and them PubChem rules)

— Reply to this email directly or view it on GitHub https://github.com/OpenSourceMalaria/OSM_To_Do_List/issues/172#issuecomment-47595214 .

MATTHEW TODD | Associate Professor School of Chemistry | Faculty of Science

THE UNIVERSITY OF SYDNEY Rm 519, F11 | The University of Sydney | NSW | 2006 T +61 2 9351 2180 | F +61 2 9351 3329 | M +61 415 274104 E matthew.todd@sydney.edu.au | W http://sydney.edu.au/science/chemistry/research/todd.html | W http://opensourcemalaria.org/

CRICOS 00026A This email plus any attachments to it are confidential. Any unauthorised use is strictly prohibited. If you receive this email in error, please delete it and any attachments.