aspuru-guzik-group / selfies

Robust representation of semantically constrained graphs, in particular for molecules in chemistry
Apache License 2.0
656 stars 127 forks source link

Table 2 rules #18

Closed woodg07 closed 4 years ago

woodg07 commented 4 years ago

Hello authors,

I tried following the rules in Table 2 in a naive way i.e. just following the grammar and I seemed to produce unphysical molecules. So for example:

X0 -> OX2 -> OOX1 -> OO=O

Could someone clarify what is happening here? The other question I had is with the code. Do the numbers in the Ring and Branch "letters" mean anything? I.e. do the numbers 1 and 3 have specific meanings: [Branch1_3] versus [Branch3_1] , and for rings it always seems to be [Ring1] in most examples that I've run. Does the "1" in ring1 become superfluous in this case or agains does it have a specific meaning? Thanks!

MarioKrenn6240 commented 4 years ago

Hi woodg07! Thanks for your interest. First about the oxygen example:

Let's say the SELFIES is [O][O][=O] as you describe, then: X0 -> O X2 -> OO X1 -> OOO the last derivation step is shown in the second line of Fig2, 3rd column, where X1 and [=O] cross. So there is no bonding violation.

About question 2, yes, those numbers have a meaning. [BranchK_L], where K gives how many subsequent letters are used for identifying the branch size. Selfies is base20, thus if you want to make a longer branch, you need to identify with a second number. below 20^2=400, you can use [Branch2_L], and below 20^3=8000 you can use [Branch3_L]. (if you have larger molecules, contact me, we can add more).

L stands for the type of bond distribution for the branch. At a branch, you have two outgoing atoms (thinking string based, not chemical), if you have the restriction that you can use 4 bonds for the next symbol, then you could distribute it as follows: 1) one to branch, three continuing the main string. Smiles example that could be generated: CCC(CCC)#C 2) two to branch, two to main string. Smiles example that could be generated: CCC(=CCC)=C 3) three to branch, one to main string. Smiles example that could be generated: CCC(#CCC)C

hope that helps -- otherwise ask again with more details or concrete examples.

Btw, we are about to publish a bit re-do of the code and a detailed documentary in 2-3weeks, there we will explain all of these techniques.

best regards, Mario