Translating SMILES to SELFIES

oriondollar commented 3 years ago

Hi all,

Very nice paper, I'm looking forward to learning more about the SELFIES representation and using them in my own generative models. I'm wondering if you could provide an example of translating a SMILES (or a molecular graph) to a SELFIES string? The example of going from SELFIES to SMILES was helpful but I am having trouble developing an intuitive sense for how certain SELFIES tokens are chosen. For instance, if we take this aromatic molecule from the ZINC dataset with an input SMILES of Clc1ccccc1-c1nc(-c2ccncc2)no1

the resulting SELFIES is [Cl][C][=C][C][=C][C][=C][Ring1][Branch1_2][C][=N][C][Branch1_1][Branch2_2][C][=C][C][=N][C][=C][Ring1][Branch1_2][=N][O][Ring1][O].

My assumption was that [Ring] tokens were placed before the atoms in that ring but that does not seem to be the case. I am also unclear on the meaning of having two consecutive [Branch] tokens and why there are two [O] tokens when there is only one oxygen atom in the molecule. Any tips for developing my intuition for SELFIES would be greatly appreciated!

MarioKrenn6240 commented 3 years ago

Hi oriondollar, thanks for your question.

A main idea of SELFIES is that we define Rings and Branches at one location. This is done by the [RingN] or [BranchN_M] symbols, respectively. At that location, we start the ring or branch. Now we need to define how long the ring or branch is. For that, we use the subsequent N letters in the SELFIES string, and interpret them as numbers in the following way: grafik

If N=1 (for instance Ring1), then we use the next symbol, which give us the size of the ring. If we have Ring2, then we use the next two symbols, and interpret them in base16 system.

The main advantage is that we define Rings and Branches locally, so no mistake can happen if the model "forgets" to close a branch for instance. Furthermore, we dont introduce new symbols to the SELFIES alphabet (such as numbers), so no additional sources of invalidity.

In your example, [Cl][C][=C][C][=C][C][=C][Ring1][Branch1_2]... means that a ring is started at this location, and the symbol after Ring1 ([Branch1_2]) is interpreted as a number. if you look at the table, you see [Branch1_2] has index=4, the ring-size=index+1=5, that means we make a ring with the 5th previous atom (which is the first carbon in the string).

Similarly, ...[C][Branch1_1][Branch2_2][C][=C]..., here we start a branch, and interpret the next symbol as a number. The next symbol is [Branch2_2], looking into the table gives index([Branch2_2])=7, the size of the branch is index+1=8. So the next 8 symbols are part of the branch.

Similarly, ...[O][Ring1][O], here the ring-size is defined as the symbol after Ring1, which is [O]. Looking at the table, index([O])=9, thus the ring connects to the 10th last atom.

More infos about this can be found in this part of the Documentation. If you have more questions, feel free to ask!

oriondollar commented 3 years ago

Ah I see, this is very helpful. I somehow missed that readthedocs page as well. Thank you for the explanation! I'll let you know if anything else comes up.

aspuru-guzik-group / selfies

Translating SMILES to SELFIES #43