aspuru-guzik-group / group-selfies

Apache License 2.0
51 stars 9 forks source link

Request for Adding Support for **Monoatomic Groups** in `GroupGrammar` #8

Open Syzseisus opened 1 month ago

Syzseisus commented 1 month ago

Hello, thanks for your wonderful work!

I would like to request adding support for monoatomic groups in the GroupGrammar class.

Currently, the system effectively handles multi-atom groups by converting them to GroupSELFIES, but monoatomic groups (e.g., [C], [F], [N]) are treated separately as individual tokens outside the GroupGrammar vocabulary. This leads to challenges when attempting to extract the full molecular connectivity between groups and individual atoms.

Required Feature:

Introduce monoatomic groups (e.g., fragC, fragF, fragN, etc.) to the GroupGrammar.vocab to ensure that atoms like ["B", "C", "N", "O", "P", "S", "F", "Cl", "Br", "I", "Li", "Na", "K", "Rb", "Cs", "Fr", "Be", "Mg", "Ca", "Sr", "Ba", "Ra"] can also be processed as groups. Allow these monoatomic groups to be added dynamically or to be included in the essential grammar set, similar to how groups like frag65, frag66, etc., are treated.

Motivation:

The main issue arises when trying to extract the molecular connectivity between the subgraphs represented by GroupSELFIES. GroupSELFIES, in essence, represents the original molecular graph by grouping atoms into subgraphs (i.e., groups). The connectivity between group tokens is clearly defined, but for monoatomic tokens like [C], the connectivity remains unclear. This inconsistency makes it difficult to extract subgraph-to-subgraph connectivity in a unified way.

Adding support for monoatomic groups would allow all atoms, even single atoms like [C] and [F], to be treated as subgraphs, ensuring that the connections between subgraphs can be easily traced and understood.

Example:

Here is an example where monoatomic atoms are treated separately from the defined groups. Ideally, atoms like [C] and [F] should be included as monoatomic groups within GroupGrammar.vocab to clarify their connectivity:

smiles: Cc1ccc(NC(=O)c2ccc(COc3ccc(F)cc3)o2)c(C)c1
GroupSELFIES: [C][:2frag65][=Branch][:0frag68][Ring1][:5frag66][#Branch][F][pop][pop][pop][#Branch][C][pop]
ATOMS
0 C 1/4 bonds filled group_tag=(3, 0)  # 1
1 C 4/4 bonds filled group_tag=(0, 8)
2 C 3/4 bonds filled group_tag=(0, 6)
3 C 3/4 bonds filled group_tag=(0, 4)
4 C 4/4 bonds filled group_tag=(0, 3)
5 N 2/3 bonds filled group_tag=(0, 2)
6 C 4/4 bonds filled group_tag=(0, 1)
7 O 2/2 bonds filled group_tag=(0, 0)
8 C 4/4 bonds filled group_tag=(2, 1)
9 C 3/4 bonds filled group_tag=(2, 0)
10 C 3/4 bonds filled group_tag=(2, 6)
11 C 4/4 bonds filled group_tag=(2, 4)
12 C 2/4 bonds filled group_tag=(1, 12)
13 O 2/2 bonds filled group_tag=(1, 0)
14 C 4/4 bonds filled group_tag=(1, 1)
15 C 3/4 bonds filled group_tag=(1, 2)
16 C 3/4 bonds filled group_tag=(1, 4)
17 C 4/4 bonds filled group_tag=(1, 6)
18 F 1/1 bonds filled group_tag=(4, 0)  # 2
19 C 3/4 bonds filled group_tag=(1, 8)
20 C 3/4 bonds filled group_tag=(1, 10)
21 O 2/2 bonds filled group_tag=(2, 3)
22 C 4/4 bonds filled group_tag=(0, 12)
23 C 1/4 bonds filled group_tag=(5, 0)  # 3
24 C 3/4 bonds filled group_tag=(0, 10)

BONDS
0 -> 1 order=1 group_idxs [0, 3]  # 4
1 -> 2 order=2 group_idxs []
2 -> 3 order=1 group_idxs []
3 -> 4 order=2 group_idxs []
4 -> 5 order=1 group_idxs []
4 -> 22 order=1 group_idxs []
5 -> 6 order=1 group_idxs []
6 -> 7 order=2 group_idxs []
6 -> 8 order=1 group_idxs [0, 2]
8 -> 9 order=2 group_idxs []
9 -> 10 order=1 group_idxs []
10 -> 11 order=2 group_idxs []
11 -> 12 order=1 group_idxs [1, 2]
11 -> 21 order=1 group_idxs []
12 -> 13 order=1 group_idxs []
13 -> 14 order=1 group_idxs []
14 -> 15 order=2 group_idxs []
15 -> 16 order=1 group_idxs []
16 -> 17 order=2 group_idxs []
17 -> 18 order=1 group_idxs [1, 4]  # 5
17 -> 19 order=1 group_idxs []
19 -> 20 order=2 group_idxs []
20 -> 14 order=1 group_idxs []
21 -> 8 order=1 group_idxs []
22 -> 23 order=1 group_idxs [0, 5]  # 6
22 -> 24 order=2 group_idxs []
24 -> 1 order=1 group_idxs []

GROUPS
<Group frag65 O=C(N(C1=C(*1)C(*1)=C(*1)C(*1)=C1*1)*1)*1>
<Group frag66 O(C1=C(*1)C(*1)=C(*1)C(*1)=C1*1)C(*1)(*1)*1>
<Group frag68 C1=C(*1)OC(*1)=C1*1>
<Group C ??>  # 7
<Group F ??>  # 8
<Group C ??>  # 9

In this example:

Conclusion:

By adding support for monoatomic groups, the molecular connectivity between all subgraphs (whether complex groups or individual atoms) can be traced uniformly, greatly simplifying tasks such as graph extraction, reconstruction, and representation.

Thank you for considering this request! Looking forward to your feedback.