aspuru-guzik-group / selfies

Robust representation of semantically constrained graphs, in particular for molecules in chemistry
Apache License 2.0
634 stars 125 forks source link

Index symbol customization questions/enhancement suggestions #69

Open csnbritt opened 2 years ago

csnbritt commented 2 years ago

Hi all - I have some questions and enhancement suggestions regarding the index symbols. Specifically whether reusing the same tokens to represent atoms/rings/branches and to calculate state Q makes the syntax more difficult for a neural network to learn, and if it would be possible to customize which symbols represent which indices.

Motivation: Per the readthedocs the current SELFIES index symbol list is the following:

Index | Symbol | Index | Symbol -- | -- | -- | -- 0 | [C] | 8 | [Branch2_3] 1 | [Ring1] | 9 | [O] 2 | [Ring2] | 10 | [N] 3 | [Branch1_1] | 11 | [=N] 4 | [Branch1_2] | 12 | [=C] 5 | [Branch1_3] | 13 | [#C] 6 | [Branch2_1] | 14 | [S] 7 | [Branch2_2] | 15 | [P]

By reusing the same tokens for both determining the state Q and representing atoms, I wonder if the process for a neural network to learn the syntax isn't made more difficult. For example, an embedding layer might try to embed [C] and [=C] close together because both represent carbon atoms, while at the same trying to embed them far apart because these tokens also represent indices that are far apart (0 vs. 12). These tokens represent different things based on context, and eliminating this context dependence could lead to better performance for translation/representation tasks. Additionally, there may be different sets of optimal symbols to represent indices based on the frequency of tokens within a given dataset. For example, a generative model trained on a dataset without phosphorus or sulfur atoms may incidentally generate molecules that contain these atoms due to decoding errors that generate these tokens in positions that are not being used to determine the state Q.

I understand that reusing the tokens to calculate the state Q allows for the 100% validity of SELFIES, but I'm wondering if it's possible to maintain this 100% validity while making the syntax more easily learned, and possibly customizable based on user needs. My questions are therefore:

  1. How were the symbols that represent each index decided? Frequency based within a large dataset?
  2. Would it be possible to allow users to customize which tokens represent which indices?
  3. Would it be possible to use characters that specifically represent indices (i.e. index tokens) when encoding a SMILES string into a SELFIES, but internally convert these index tokens back into tokens that don't break the 100% validity of SELFIES when decoding back to SMILES?

==== EXAMPLE ===

Below is an example to illustrate what I mean by index tokens. A new index symbol table might look like this:

Index | Internal Symbol | External Symbol | Index | Internal Symbol | External Symbol -- | -- | -- | -- | -- | -- | 0 | [C] | [Index0] | 8 | [Branch2_3] | [Index8] 1 | [Ring1] | [Index1] | 9 | [O] | [Index9] 2 | [Ring2] | [Index2] | 10 | [N] | [Index10] 3 | [Branch1_1] | [Index3] | 11 | [=N] | [Index11] 4 | [Branch1_2] | [Index4] | 12 | [=C] | [Index12] 5 | [Branch1_3] | [Index5] | 13 | [#C] | [Index13] 6 | [Branch2_1] | [Index6] | 14 | [S] | [Index14] 7 | [Branch2_2] | [Index7] | 15 | [P] | [Index15]

benzene SMILES: c1ccccc1

current benzene SELFIES: [C][=C][C][=C][C][=C][Ring1][Branch1_2]

With the new index table, rather than representing the state Q with token [Branch1_2] in the SELFIES string, we could represent it with [Index4], so the new benzene SELFIES that is encoded with index tokens would be: [C][=C][C][=C][C][=C][Ring1][Index4]

When decoding any of these SELFIES back into SMILES, all "external index symbols" could first be replaced with corresponding "internal index symbols" so that the 100% validity is maintained. Networks may learn the syntax that use the index symbols more easily however because each token only corresponds to one action, rather than representing either a state calculation or an atom depending on the context.

This external/internal symbol idea could be made customizable by allowing users to define which internal index symbols are mapped to external index symbols.

Final thoughts: Apologies if any of these suggestions/questions have been proposed or answered before, I took a look through the original SELFIES paper and the open and closed issues on github and didn't immediately see anything related.

I don't see how the internal vs. external index symbols could break the syntax, but perhaps there is a reason not to do this that I missed. Additionally, adding the extra step of converting external index symbols into internal index symbols before decoding into SMILES adds complexity to the process for what may be no additional gain in performance. I don't know whether a new syntax that uses tokens specifically for state Q calculation is actually more easily understood by neural networks/humans, but I thought it may be worth testing.

Regarding the customization of the index symbols, one problem that I foresee is that index tables would need to be shared to properly decode SELFIES that use custom tables. Again, additional complexity for an unknown amount of performance gain (if any).

MarioKrenn6240 commented 2 years ago

Dear @csnbritt - Thanks a lot for the comment/suggestion, i like it a lot!

I agree with your concern on models that might produce sulfor/phosphore-containing molecules just because of the indices, thats not ideal. Let me answer your questions:

ad1) Very good question, we actually didn't systematically investigate the order of index symbols. I think there lies a great opportunity for simplifying the language.

ad2) That is a really great idea, it would easily enable a detailed analysation of your first question. Do you want to give it a try and write this function? It could be developed similar to the customization of the semantic bond constraints. If you want to help writing this subfunction, we could add it into the official 2.0.1 release, and potentially use your insights as a new indexing alphabet. Let me know.

ad3) Very good point, and we thought about this before. The problem with additional letters is the following: What would you do in this case: [C][C][Ring][C][C][C]? It wouldnt have a well defined meaning. And the purpose of SELFIES is to have a well-defined meaning for every combination of symbols. thus overloading the symbols seemed to be the best solution.

All the best, Mario

PS: @csnbritt would you mind sending me an email (none is connected to your github account as far as i can see)

csnbritt commented 2 years ago

Great, thanks for these answers! I'd be happy to take a crack at writing the index symbol customization function.

I see what you mean regarding the non-well defined meaning of some strings if additional letters were used. With the ability to customize index symbols, perhaps the context dependence issue could still be investigated by setting index symbols to such rare tokens that they would never be realistically used except to determine state Q. Tokens for crazy isotopes of carbon like [101CHexpl] seem like potentially good albeit hacky candidates for this because they would only ever be encountered as index symbols by a network, errors by networks that generate these tokens in incorrect positions could easily be identified, and carbon isotope tokens in incorrect positions can be easily made into standard carbon atoms with some postproccessing. I also suspect that for many tasks/datasets mistakenly generating a carbon token in an incorrect position would be less troublesome than mistakenly generating a ring/branch/sulfur/phosphorus token in an incorrect position, but this is just a hunch.

RE email - I've sent you an email and updated my profile so that mine is public