OpenSourceMycetoma / Series-1-Fenarimols

Open Source Mycetoma's First Series of Molecules
10 stars 3 forks source link

Duplicate MYOS codes found in Master Sheet for some compound structures #56

Open fantasy121 opened 3 years ago

fantasy121 commented 3 years ago

Currently there are 11 structures that have multiple MYOS codes assigned to them. There could be a few reasons that this happened, as follows:

A) Check if there is a mistake in the structures (eg: maybe a copy-paste error in the SMILES string that might give the same structure to two different compounds)

B) If two entries are the same compound structure but are synthesised separately (like a resynthesis), then the MYOS code should be almost identical except for the last two digits. As seen in the red boxes, (compounds 2, 3 MYOS_00001_00_01 or ...02; and compounds 5, 6 MYOS_00003_00_01 or ...02).

C) If two entries are the same compound, but this compound is tested twice, then our current code system does not have a way to account for this. Only the ascending numbers will be different (ie. the results of the one compound are on different lines). As seen in the blue box (a compound that spans two lines (ascending number 19 and 20, with the exact same MYOS code but different biological activities)

Since many of these duplicates have not been resolved, can you have a look at this issue @dmitrij176? Particularly if it's an incorrect SMILES string (case A), which was my main concern. Otherwise, can you give them the appropriate batch numbers (case B) or just confirm that case C has occurred?

For instance, we know 143 is a resynthesis compound so between [13 and 143], resolve this pair by either changing them to [MYOS_00010_00_01 and MYOS_00010_00_02] or [MYOS_00138_00_01 and MYOS_00138_00_02].

Tagging @OpenSourceMycetoma/corecontrib to keep everyone up to date Duplicate MYOS codes

mattodd commented 3 years ago

Thanks for laying all this out @fantasy121 - important we clarify these and amend the codes. I guess we need to ensure that if anyone is adding compounds to the master list that they search on the SMILES first to check if the compound is already present, and we ought to add that to a "tech-ops" page, if we don't already have it.

For compounds that are tested twice on the same batch, I don't mind how this is dealt with. Either two rows or another column. Happy to hear other views.

dmitrij176 commented 3 years ago

Hi @fantasy121. Thanks for these suggestions; I did all necessary corrections. In some cases, duplicates arised because of the lack of information (eg. Enantiomeric forms, racemates etc) that was added after the list was created. I know that, because I have backup versions of the excel files. I will go individually through each of the 11 problems and explain/comment what has happened:

  1. MYOS_00001_00_01/02 (entries 2,3) and entry 151(DM4-3B): indeed they refer to the same compound, but DM4-3B was registered as a separate entry in the Fenarimol List with a unique code P4_A_DM43B, and thats why it initially appeared separately in the excel sheet as well. Now its the third batch (MYOS_00001_00_03).

  2. Entries 19, 20 as you said correctly refer to the same batch which was tested twice. We knew this before, but atm we havent resolved as to how correctly mark that under the existing coding system. @mattodd any thoughts?

  3. Entries 5,6: correct

  4. Entries 42, 80: duplicates; corrected

  5. Entries 48, 49: two separate molecules, one with incorrect SMILES code. BS0400- wrong SMILES, BS0407- correct structure. Both entries resolved.

  6. Entries 55, 102: duplicates; corrected.

  7. Entries 36, 100: duplicates; corrected.

  8. Entries 71, 126: duplicates; corrected.

  9. Entries 7, 122: S-enantiomers, corrected.

  10. Entries 13, 143: Racemic mixtures. Same situation with the Fenarimol list as in (1). My entry appeared under the code P4_F_DM16. Now resolved.

  11. Entries 131, 132: I checked these again and there are 2 important things to note: In the Epichem library they are recorded as two separate entries (EPL-BS1482, EPL-BS1483) but at the same time have identical SMILES codes. No additional information is available and these 2 are the only unresolved in the table. I assume there must be a reason why they are separate, unless there is a mistake. But again, I double checked the Epichem and can confirm that they are separate entries. Shall we leave them as they are (separately) or merge the MYOS codes?

fantasy121 commented 3 years ago

Thanks @dmitrij176 for updating the Master List codes.

I vote for re-test compounds receiving the same MYOS_XXXXX_YY_ZZ code as the first time tested, but being put on a separate line (thus getting new ascending numbers. @mattodd

I put together a brief guide to help enter new MYOS codes and avoid duplicates in the future, which reads:

MYOS code entry guide: Before submitting a new MYOS code for your compound, please search this List to see if your compound already existed, and name it according to the convention. You can search for potential duplicates by looking up SMILES strings.

Reminder on naming convention for MYOS codes:

Format: MYOS_XXXXX_YY_ZZ

(i) The XXXXX 5-digit code is assigned to unique chemical structures. For this naming purpose, stereoisomers of a compound are considered unique structures. (eg: racemic mix and enantiomerically pure versions of a compound will receive unique XXXXX codes.)

(ii) YY is the salt form

(iii) ZZ is the batch number suffix and starts from 01. A compound that is re-synthesied will receive a unique incremental batch number suffix ZZ but the same XXXXX_YY prefix as the compound from the first batch, if its structure and salt form are the same as the first batch compound. Otherwise, just assign a new XXXXX and/or YY if either/both deviates from the first batch; the ZZ then resets to “01”.

(iv) If the same batch of the same compound is re-tested for biological activity, the result will be put on a separate line. The re-tested compound will receive the same MYOS code as the first time it was tested, but will acquire a new ascending number (as it is on a new line).

I put this as a highlighted comment on the Master List