changes to ease «probing the software»

nbehrnd commented 4 years ago

The changes touch the «title page» and the test data only. The aim was to provide just enough information that the potentially interested knows

about the simultaneous presence of categorical no-go criteria and the scaled demerits
there are multiple ways to get the program to work, and
the repository includes test data (e.g., with data of table S3 of the SI, offering a tutorial-like replication of the scrutiny at smaller scale in addition to the substantially larger PubChem set).

nbehrnd commented 4 years ago

Prior to this pull request, the original test set, example_molecules.smi contains some entries more than once. This is visible, e.g. loading the file with the Emacs editor, requesting an alphabetic sort of the lines (which then considers the entire string of SMILES string + PubChem identifier) by the chords of C-x h to mark all entries, then M-x sort-lines) and and a remove of duplicates (chords C-x h followed by M-x delete-duplicate-lines).

Similar as CAS registry numbers, a structure may be attributed more than one PubChem identifier.[1] Thus, I opted for a filter which is structure sensitive and used the --unique cansmi[2] option by openbabel (version 3.0.0, released by Apr 6 2020 in the repositories of Debian 10 (bullseye) / branch testing) which is based on the structure representation with canonical SMILES strings including the stereochemical information.

The numbers of entries (still) to be checked in example_molecules.smi is lowered quite a bit. Starting with 30810 entries, there are now only 22981 (-7829 entries, or about 25% less). As a comparison, the pure string-based simplification with Emacs shortens the list by 5824 entries (ca 19%) to 24986.

It is not so much about accelerating the work of the program here (albeit the readout of time preceding the instruction relayed to Ruby (2.7.1p83) in Linux Debian 10 with the old .smi and the revised .smi shows some change). It is more about a later analysis, e.g., the ratio of structures good-enough to proceed further over the total number of structures submitted to the scrutiny. Because SMILES don't store the atom's positional information like .cif, I speculate occasionally loosing a PubChem identifier does not hurt here this much. (It could be prevented, e.g. with a different --unique title keyword provided to openbabel, though, then retaining 24975 entries.)

[1] https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4702940/ [2] https://openbabel.org/wiki/--unique

IanAWatson commented 4 years ago

This looks good except for one important thing.

Agree that getting the duplication out of the examples is a good thing to do, we should do that. But, when you use OpenBabel to do uniqueness, what gets written contains aromatic forms of various rings - Open Babel's unique smiles of course. But this software will not necessarily read all the aromatic forms that OpenBabel produces. Aromaticity is one of those things that is not well defined, so different software packages can make their own decisions about what can, and cannot be aromatic. Mostly if a certain package decides that a ring, or ring system, is aromatic then it will be able to read that. But there is no guarantee that a different piece of chemistry software will be able to read that aromatic smiles. I am sure each package has both good and bad decisions made with respect to aromaticity.

So, instead of the aromatic smiles that are currently in example_molecules.smi, could you fetch back the non-aromatic smiles that were in the original example file. Keep the unique set of identifiers that you have identified (curiously there is 1 duplicate id PBCHM73875), definitely a good idea, but please let's not use aromatic smiles. When you are done, there should be no lowercase c, n, o etc. in the smiles file.

Or if you prefer, I can do this with what I have downloaded... Either way...

I also like the cleanup on the known drugs that you have done too.

Thanks!

Ian

nbehrnd commented 4 years ago

While the documentation has a line #51 "Note that the software is set up to ignore smiles it cannot interpret." I previously did not understand this as "aromaticity understood e.g., by openbabel (the lower case chars) needn't be the same as understood by this program". Thus, I should inspect my deduplication attempts on list example_molecules.smi as input, and their outcome with the program. One element could be just a wc -l *.smi to line count how many of the entries submitted eventually are binned in either ok.smi, or one of bad?.smi.

Local probing yesterday, to circle the structures including intervention by openbabel alone, or (openbabel .AND. DataWarrior), starting from the .smi -> .sdf -> .smi indeed revealed to me some unexpected limitations, e.g. either the change of

c12-c3c([C+](c2cccc1)C)cccc3  PBCHM10039208
c12-c3c([C+](c2cccc1)C)cccc3  PBCHM10039209
n12c(-c3ccccc3NNC2)cnc1  PBCHM122592431

into

> c12-c3c([C+](c1cccc2)C)cccc3  PBCHM10039208
> c12-c3c([C+](c1cccc2)C)cccc3  PBCHM10039209
> n12c(-c3ccccc3NNC1)cnc2  PBCHM122592431

which chemically speaking shouldn't matter much, contrasting to a sudden introduction of stereochemical descriptors (depending on an intermediate .sdf in version 2 or version 3). With these experiences in hand, I see a deduplication which accounts for the complete line (SMILES string .AND. PubChem identifier) as plausible compromise. It retains both SMILES' syntax and the explicit spacing between SMILES and label, and still is efficient (about one in five lines affected), as seen on the original list with lines just alphabetically sorted vs. deletion of duplicate / triplicate lines, e.g.:

vimdiff_sort_sort_Emacs

This approach will retain entries like PBCHM32961 once described "pure" (O(N=c1c2ccccc2ccc2ccccc12)CCCN(CC)CC) and (at least) once altogether with HClO4, though which mirrors your find of PBCHM73875. And it will allow the simultaneous presence of PBCHM2240870 and PBCHM2240871, both describing S(C)[C@@]1(NC2CCCCC2)NNC(=C)C(=O)N1.

sort.smi.txt sort_Emacs.smi.txt

nbehrnd commented 4 years ago

This pull request is going to be retracted in favour of the deduplication based on «comparing line-by-line».

IanAWatson / Lilly-Medchem-Rules

changes to ease «probing the software» #14