This function used to filter out conformers with inconsistent smiles relative to the given smiles (in this script this is corrected_smi). In my reproduction, most cases that the inconsistency exists are molecules with a Z/E-double bond. These cases will not be filtered out if isomericSmiles=False, which makes me confused and I'm not sure if this is a mistake.
For example, now conformers with smiles Cc1cc(C(=O)c2cnc(/N=C/N(C)C)s2)c(F)cc1Cl and Cc1cc(C(=O)c2cnc(/N=C\N(C)C)s2)c(F)cc1Cl in reference data will all be saved for comparison although GeoMol was used to only generate conformers with Cc1cc(C(=O)c2cnc(/N=C\N(C)C)s2)c(F)cc1Cl.
Compared with that, the code in model/featurization.py filtered out the conformers with inconsistent smiles relative to the smiles in the dataset.
So actually, if I used compare_confs.py to calculate the performance with isomericSmiles=False, the conformers with different isomeric SMILES will not be filtered out and the performance was the same as or even worse than before (since that GeoMol was used to generate only one stereoisomer based on the given SMILES).
The performance comparison between GeoMol prediction and reference data (before using clean_confs; using clean_confs; change isomericSmiles=True:
**Before**
Recall Coverage: Mean = 74.78, Median = 85.00
Recall AMR: Mean = 0.9471, Median = 0.9176
Precision Coverage: Mean = 71.84, Median = 87.50
Precision AMR: Mean = 1.0035, Median = 0.9649
After (with clean_confs, more confs are included than before)
Recall Coverage: Mean = 74.30, Median = 90.00
Recall AMR: Mean = 0.9489, Median = 0.8797
Precision Coverage: Mean = 65.50, Median = 81.80
Precision AMR: Mean = 1.1044, Median = 1.0041
isomericSmiles=True
Recall Coverage: Mean = 83.38, Median = 100.00
Recall AMR: Mean = 0.8233, Median = 0.8079
Precision Coverage: Mean = 72.73, Median = 87.50
Precision AMR: Mean = 0.9833, Median = 0.8895
As you can see, if `isomericSmiles=True`, the performance in GeoMol paper's result can be reproduced.
***
When I tried to walk further related to this issue, I found another weird thing that GeoMol will generate the conformers close in 3D geometry though with different stereoisomerism in SMILES as input. And the conformers close in 3D geometry are different stereoisomers in their SMILES. This issue does not exist in RDKit ETKDG and I am not sure if it will affect GeoMol's performance on these molecules. Here I give two examples on that,
|SMILES| GeoMol (trans) | GeoMol (cis) | ETKDG (trans) | ETKDG (cis) |
|--| -- | -- | -- | -- |
| O=S(=O)(_N=C(_c1ccccc1)N1CCOCC1)c1ccc(Br)cc1 |![image](https://user-images.githubusercontent.com/56123242/148317945-d967bc36-8813-4eb1-8597-a9fadb000e29.png)|![image](https://user-images.githubusercontent.com/56123242/148318189-ec89a81d-414e-48b9-8568-6ad7b62e483a.png) | ![image](https://user-images.githubusercontent.com/56123242/148321905-1fe23be1-da2c-4dc0-b1e3-65d50f50f498.png) | ![image](https://user-images.githubusercontent.com/56123242/148321930-68f49cfe-08bc-4607-a877-47947e5aec33.png)
| Cc1cc(C(=O)c2cnc(_N=C_N(C)C)s2)c(F)cc1Cl|![image](https://user-images.githubusercontent.com/56123242/148318227-6948f3de-876d-482d-a05f-c3ae8ac38100.png)|![image](https://user-images.githubusercontent.com/56123242/148318342-db154f0a-01be-4ae4-a0f5-811954750ad2.png) | ![image](https://user-images.githubusercontent.com/56123242/148321946-1cbb5f8f-e8ed-4429-af9f-c458137e3dbb.png) | ![image](https://user-images.githubusercontent.com/56123242/148321954-f94c5ddc-5c9a-433e-8db4-c4d8dcbb2237.png)
https://github.com/PattanaikL/GeoMol/blob/5d0e85014a9546209d5b43861638caabb362ec25/scripts/compare_confs.py#L49-L56
isomericSmiles=False
, which makes me confused and I'm not sure if this is a mistake.Cc1cc(C(=O)c2cnc(/N=C/N(C)C)s2)c(F)cc1Cl
andCc1cc(C(=O)c2cnc(/N=C\N(C)C)s2)c(F)cc1Cl
in reference data will all be saved for comparison although GeoMol was used to only generate conformers withCc1cc(C(=O)c2cnc(/N=C\N(C)C)s2)c(F)cc1Cl
.https://github.com/PattanaikL/GeoMol/blob/5d0e85014a9546209d5b43861638caabb362ec25/model/featurization.py#L125-L126
model/featurization.py
filtered out the conformers with inconsistent smiles relative to the smiles in the dataset.compare_confs.py
to calculate the performance withisomericSmiles=False
, the conformers with different isomeric SMILES will not be filtered out and the performance was the same as or even worse than before (since that GeoMol was used to generate only one stereoisomer based on the given SMILES).isomericSmiles=True
:After (with clean_confs, more confs are included than before) Recall Coverage: Mean = 74.30, Median = 90.00 Recall AMR: Mean = 0.9489, Median = 0.8797 Precision Coverage: Mean = 65.50, Median = 81.80 Precision AMR: Mean = 1.1044, Median = 1.0041
isomericSmiles=True Recall Coverage: Mean = 83.38, Median = 100.00 Recall AMR: Mean = 0.8233, Median = 0.8079 Precision Coverage: Mean = 72.73, Median = 87.50 Precision AMR: Mean = 0.9833, Median = 0.8895