Issue with Meaningfulness and Interpretability of Fragments Generated by SMILES Pair Encoding (SPE)

Hello,

I recently came across the study "SMILES Pair Encoding: A Data-Driven Substructure Tokenization Algorithm for Deep Learning," and I found it quite interesting. However, after implementing the SMILES Pair Encoding (SPE) approach and generating fragments of molecules using the learned vocabulary, I observed that a significant portion of the fragments do not map to valid molecules when processed with RDKit's Chem.MolFromSmiles().

Issue:

The learned SPE Vocabulary generates fragments that lack meaningful chemical interpretation.
A large number of fragments obtained using the learned vocabulary map to "None" when processed with RDKit's Chem.MolFromSmiles().
This raises concerns about the interpretability and accuracy of the tokenization process.
A majority of learned substructures contained in "SPE_ChEMBL.txt" amp to "None" as well: Precisely, 2232/3002 learned substrings ar enot RDKit Sanitazable.

Expected Behavior: Ideally, the fragments generated by SPE should correspond to chemically meaningful and interpretable substructures, and they should be convertible into valid molecules when processed with RDKit's Chem.MolFromSmiles().

Screenshots for reference WhatsApp Image 2023-07-29 at 10 27 55 AM

Impact: In my opinion, the lack of meaningful and interpretable fragments generated by SPE may limit its utility in various cheminformatics applications, including molecular generation and predictive modeling. It could also hinder the interpretability of deep learning models relying on SPE tokenization.

Additional Information:

Python version: 3.7.16

XinhaoLi74 / SmilesPE

Issue with Meaningfulness and Interpretability of Fragments Generated by SMILES Pair Encoding (SPE) #22