What Codes Should MycetOS Compounds Be Given?

mattodd commented 6 years ago

There's been an extensive discussion on Twitter between me and people who know a lot more about this than I do, about the syntax of codes we should give to compounds in this project. The upshot is that there are many good opinions, but no "standard".

Since MycetOS is essentially beginning from scratch today (here's the Master List with the first column conspicuously blank) we have a clean slate and can do whatever we want.

So ... what do we want?

In Open Source Malaria we commit the sin of including hyphens. So OSM-S-5 is the 5th compound to be made in Sydney, for example. I like that we have, in the code, something of the compound's provenance. But it's going to be more important to capture, somewhere, data on i) batches (i.e. separate samples of the same molecule) and ii) whether we have a salt form or not. There's a long-standing question over at OSM (OpenSourceMalaria/OSM_To_Do_List#172) also about whether compounds of different (or even unknown) enantiomeric excess have different codes.

There will be lots of solutions used in closed organisations/projects, but we have the challenge here that MycetOS's molecules are open. We should therefore ask what opportunities there are for direct integration in public databases such as Pubchem.

(It feels to me that we need a "colloquial" name on the one hand - something short like MyOS5, that allows us to include such numbers in discussions - as well as a longer, formal names for samples and the Master List.

But let's keep it simple. What format should we use? This Issue is intended to allow for more detailed open discussion than is possible on Twitter.

drc007 commented 6 years ago

Something like acronym_saltid_batchnumber.

MyOS123456_0_5

Which would be molecule 123456 from the project, batch 5 of the free base.

MyOS123456_1_2

Which would be batch 2 of the first made salt.

Having a way of identifying whether a sample is a salt is very helpful when comparing in vitro, in vivo and DMPK data. When anomalies in the biological data arises it is often useful to know if the same batch was tested.

drc007 commented 6 years ago

The issue of stereochemistry and identifier is a thorny issue.

Racemate is given a code number, resolved enantiomers are given different individual codes.

You then have to devise a set of business rule to determine what level of enantiomeric excess is needed.

Similarly with diastereoisomers, isolated individual diastereoisomers can be given unique codes, what level of de is needed is a project business rule.

cdsouthan commented 6 years ago

As a prelude, note that some of the twitter points are nested in split reply threads not obvious from Mat's link

The comments indicate a general division between the parsers and the chemists (but not vs of course...). The former want the simplicity of, say, ABCD123456 (a la CHEMBL, MMV, DNDI ect), crucially, without any punctuation at all and sans prefixes or suffices, NOBA they need these to be unequivocally mapped to PubChem SIDs/CIDs and InChIKeys sooner rather than later. Obviously there is the option of 26 sub-codes at "D" and/or another 10 at the first digit (if the collection is not expected to be too humungous)

On the other hand @drc007 represents the chemist camp (with impeccable credentials). Since it is chemists who drive the early stages of these open projects, it makes sense to defer to such important practical considerations in their operational favour, and leave the parsers to wrangle as best they can down the line. However, it looks difficult to instigate these suggested chemistry/buisness rules in the absence of a robust substance registration system (which is an old chestnut topic in these parts).

The more radical option of using PubChem as a registration system from start to finish, including BioAssay submissions (as outlined in my twitter comments) would seem a step too far just now :(

For sure CDD https://www.collaborativedrug.com/ comes into the frame as not only looking after the registration but also a built-in ELN as well as the normalisation of assay results for intra-CDD data mining (DOI, I'm on the SAB). However, it may not be open enough for parties here (e.g. only sporadic PubChem uploads and without BioAssay entries)

bendndi commented 6 years ago

I'm a full supporter of the nomenclature proposed by @drc007 for salts and batches. The ideal is if the salt code is predetermined for all compounds (e.g. salt code 0 = free base, salt code 1 = HCl, salt code 2 = HBr etc.) but this requires someone to go through and compile that dictionary of salt codes before hand, and is much more relevant when dealing with a very large land divers library such as those found in Pharma.

bendndi commented 6 years ago

Re enantiomers / diasereomers; I'd argue it is essential that each requires it's own parent code (i.e. one parent code for the rac, one for each of the enantiomers. I'd argue that the cut off between a racemate and an enantiomer would be around 80% ee (i.e. 90% of one form dominating)

mattodd commented 6 years ago

OK, so at the moment it looks like the consensus is MyOS123456_0_5.

Happy with underscores rather than hyphens. Happy for second digit (0 above) to be salt code, and we can evolve the codes as we go. Happy for batch number to be the last digit. Happy for new codes to be needed for stereoisomers. Likely won't be too common, I'd have thought.

Colloquially we can just ignore the salt and batch codes, in discussions.

Questions: 1) Can we start with MyOS1, or does it need to be MyOS000001? Former will obviously be easier. 2) Everyone happy with our assigning codes retrospectively to all 54 compounds currently in the Master List? 3) Everyone happy with assigning codes to synthetic intermediates, i.e. compounds that have been made but that will not necessarily be biologically evaluated? 4) Can people countenance codes being assigned to target structures, i.e. yet-to-be synthesised? Perhaps only the high-value structures? This can help in the discussion of such targets, and they could be given a "batch" of zero. There are other ways to refer to molecules colloquially, but it can I think still be useful to capture discussion/planning around a compound by using a consistent code.

cdsouthan commented 6 years ago

1) Suggest MYOS000001 (mixed case will cause confusion since searches may or may not ignore it)

3) Since numbers should not be that high, the precautionary principal (you just never know...) indicates that all purified and stable intermediates should routinely be run through the assays (could be embarrassing if it turned out you could have had some FBDD data without knowing..)

4) Sure, give the virtual designs a code. As said before there are aguments for putting them into PubChem because their relationships (including 3D) would be pre-computed

5) Agreed the need to resolve stereoisomers may turn out to be low in practice, since you'd need a fairly stonking activity (and low variance in the assay) with the rac mixture to make it worth doing in the first place

6) If you end up doing a lot of coding and collating just in Excel sheets without a registration system, trouble may brew later (not that a reg system precludes this completely)

mattodd commented 6 years ago

MYOS rather than MyOS, fine, but can we start at MYOS1, or does it have to be MYOS(insert desired number of zeros)1

MFernflower commented 6 years ago

searching myos1 into google shows an amazon link is top result: https://www.amazon.com/MyOS1-Find-Out-Good-Really/dp/1465203974 @mattodd @cdsouthan Might have to use something like mycet1, madur1, etc

david1597 commented 6 years ago

Just from the point of view of working with the OSM master list, and also the associated NMR/mass spec data, it would have made it slightly easier and neater to have used the zeros. MYOS0001 etc. Four digits would likely be sufficient?

markussitzmann commented 6 years ago

My honest advice (and I have suffered these kind of discussions too often): these kind of nomenclatures, in particular if they also encode characteristics of the compound (like salt code or stereoisomers), are never gonna work. Give it generic numbers and keep the rest as part of the Excel spreadsheet column, database column etc. (whatever you use).

cdsouthan commented 6 years ago

(Hi Markus, small world!) Since I have the good fortune to be a co-author (https://www.ncbi.nlm.nih.gov/pubmed/24533037) with @markussitzmann, I can attest to his expertise (that his publication record does anyway). JFTR the paper has some relevance to this discussion. While I was being rather more circumspect in my quotes "leave the parsers to wrangle as best they can down the line" and "trouble may brew later" the above comment is a more direct warning.

bendndi commented 6 years ago

We can do without the salt codes as part of the main name if people insist (some of the big pharma I work with use this approach this, others don't). What I see as essential is that if 2 (or more) samples of the same structure exist, they are both coded with the same parent code, and are then differentiated by a batch.specific identifier. If this batch aspect is part of the parent, or is a separate column in the excel that's fine. In my experience the former is more common in big pharma, but at DNDi we use the latter.

bendndi commented 6 years ago

To answer some of @mattodd questions, I'd always start for the begininig with all the zeroes in place so that all compound IDs have the same number of digits... Otherwise certain analysis software can get confused (particularly if for example plotting a project progression of a vale versus time, with Compound ID as the "time" axis.

One other suggestion I have coming from previous experience in this area would be to start with MYOS100001 (note the 1 at the start of the digits) as the first compound, to avoid issues with losing the redundant zeroes in certain manipulations.

If you want to assign codes to "designs" and to not-tested synthetic intermediates this is in theory fine, and I like the suggestion of giving a design compound batch "0"; but another option I'd suggest (depending on if we go with discreet compound Id and Batch ID) is assigning virtual compounds / intermediates a batch ID only and then only using compound IDs for anything which "exists and is scheduled for testing" (again, this is personal experience, based on how we deal with this distinction for some projects at DNDi).

cdsouthan commented 6 years ago

OK so MYOS123456 becomes primary ID but the registration system codes batches and salts (but don't forget PubChem will split any salt you submit) and enantiomers get parent-child new IDs (which you might also want to do for metabolites down the line)

cdsouthan commented 6 years ago

But @bendndi DNDi went with CDD?

bendndi commented 6 years ago

@cdsouthan DNDI don’t use CDD. We use Sciencecloud (Biovia), which is better suited to our requirements. Unfortunately we cannot use sciencecloud for the MycetOS project.

cdsouthan commented 6 years ago

OK @bendndi. I got confused because the Pollastri lab do use it and they work with DNDi

(who are you, if I may enquire?)

rajarshi commented 6 years ago

A few comments on various aspects raised in this thread

I second Chris and Markus - encoding meaning in the identifier is handy, especially for chemists at the point of synthesis. But this is not a good long term idea. Batch number is the only meta-data that I could see being included as part of the identifier
If batch numbers are used in the identifier, consider using a 2 digit batch number (00, 01, etc)
leave the parsers to wrangle as best they can down the line is doable and informatics will likely be able to deal with it (or not!). But when this discussion is happening up front, why push ambiguities down the road, rather than resolving them upfront?
I'd go for MYOS000001 (with however many 0's as decided on) rather than MYOS1

cdsouthan commented 6 years ago

Agreed, batch IDs are essential for many reasons (including needing to be locked to their own assay results). However, there's an argument for separating the mapping of these from the primary MYOS000001. This is because these can then go straight in as a PubChem/BioAssay name/synonyms from SID > CID. Batches can not, being the same structure, but (as mentioned) they can have their own SID so long as the batch-mapping (presumably to the ELN entry) is locked-down internally

drc007 commented 6 years ago

An interesting discussion. Salt form is absolutely critical once you start to run DMPK or in vivo assays. Using a form like MYOS123456_01_5 where the first six digits refer to parent structure allows for easy discussion, but all experiments and lab notebooks need to use the full notation. Other wise you can get in the situation where you are trying to compare plasma levels where a free base has been administered with a situation where a salt was used. Batch identification is useful to allow for variations such as different polymorphs, which may have different properties such as melting point, solubility and thus in vivo activities.

mattodd commented 6 years ago

OK, great. i) Let's say we'll find a cure for mycetoma before we hit 100,000 compounds... So MYOS00001 will be the first compound, giving us up to MYOS99999. Compromise between something easy to read and short (MYOS0001) and something bigger but more difficult to read and likely to introduce human error (MYOS000001). 5 numbers it is. @bendndi I wasn't sure why you wanted a "1" at the beginning there - maybe my compromise of 5 digits reduces the chance of errors? By the time we get to molecule 100 (pretty soon) I think the numbers will be easy to parse. ii) These numbers can be used colloquially to discuss compounds without reference to batch or salt. iii) Then the next two numbers are salt code, with 00 meaning free base, so MYOS00001_00. iv) Then batch comes last. Each batch will, in the master list, be connected with an internal lab book ID also. So MYOS00001_00_01 v) We can assign MYOS codes to any molecule made or inherited. vi) Should we need to refer to key molecules that have not yet been made (but clearly need to be) we use a batch code of zero. e.g. "I am aiming to make MYOS05675_00_00". I like this idea @bendndi just because it's been useful when discussing synthesis planning in OSM.

Yay/nay?

bendndi commented 6 years ago

Fine for me.

(@mattodd, it's not a deal breaker by any stretch of the imagination (!) but the reason I suggest starting with MYOS10001 is so that in the event of reformatting / annotating IDs to pure numerical form (i.e. MYOS123456 becomes 123456), which someone somewhere may need to do for a specific piece of software, they still all have 5 digits. It seems a bit of an arcane suggestion but I remember running into this exact problem about ten years ago now, when I had to do some annotation acrobatics to get around the problem of purely numerical IDs with varying numbers of digits....(I think it was when my organization finally reached the dreaded XXX99999 situation of having to add an extra digit to the compound ID to accommodate new compounds)

cdsouthan commented 6 years ago

Almost fine...

1) We need to foresee the consequences of what we eventually push through to PubChem and BioAssay. If we decide to submit salts/mixtures per se these will split three ways to parent, mixture and counterion CIDs. So if the salt code is an addendum that is dropped from the core ID, then that string will synonym-map to 2 different CIDs (i.e. be duplicated). The magnitude of the possible problem depends on how many mixtures we want to push in. I'm guessing very few will actually get in vitro and or DMPK results that are significantly different between the use of free base or salt in the actual experiment (and the results of which you want to surface in BioAssay). These can be dealt with on a case by case basis, even using the option of a new core code. Note BioAssay suffers from confounding incidences of assay mappings to multiple salts and parents where it is difficult to discriminate between the authentic or spurious reported differences in activity (not to mention zillions of irrelevant mixture CIDs anyway mainly from patent extractions)

2) As has been discussed in the malaria work another good reason for batch codes is to check reproducability, i.e. some of the potent front runners need to be re-synthesised just to make sure the re-assayed activity is very close. This crucial control is surprisingly (or perhaps not, depending on your viewpoint...) infrequent in the open med chem literature.

agaulton commented 6 years ago

Sorry, a bit late to the discussion, but from ChEMBL point of view I'd agree with much of above: 1) Different codes for racemates and individual stereoisomers (leave to you to decide the cutoff) 2) Distinct identifiers for different salt forms/parent - either a salt code suffix (my preference but not essential) or a different ID. Would cause a lot of mess if the same ID (i.e., just the parent code) was used to refer to data for different salt forms, not least because of the issue @dcr007 mentioned with comparing doses for in vivo data (or screening data with weight-per-volume units). 3) Batch less important to be part of the identifier from ChEMBL point of view but fine if it's there

mattodd commented 6 years ago

Thanks @agaulton , very useful.

@cdsouthan - I'd say that the full name (including salt, batch etc) will be important for formal internal use (when molecules are evaluated) but for pushing to Bioassay presumably the colloquial code would be enough, because if anyone wants the detail, they can get it. MMV numbers, for instance, are entered without the suffices/details, I think?

So - we're happy with the above, where we are committing to using salt/batch for internal documents to track samples e.g. in spreadsheets, lab notebooks (MYOS00001_XX_XX), but we can use the more colloquial numbers in public project discussions and diagrams etc (MYOS00001) to denote the basic structure?

mattodd commented 6 years ago

Thank you all for this useful discussion. I've installed a link to the discussion in the wiki, to prevent it being orphaned. So I'm now closing.

OpenSourceMycetoma / Series-1-Fenarimols

What Codes Should MycetOS Compounds Be Given? #6