Robustness to SMILES Permutations

JanoschMenke commented 1 year ago

Hi Guys,

I was wondering if you guys looked into the effect of using random vs canonical SMILES. I would assume that it might be difficult to identify equal molecules with different SMILES using gzip or more specifically LZ77. Even if you just use canonical SMILES, it might be tough in cases where two similar molecules get very different SMILES due to where for example branches or rings are opened.

This is where "trained models" might have the edge as they can learn the SMILES syntax more throughly and encode similar molecules similarly even tho their SMILES are very difficult.

One could investigate this, by identifying molecules which have a high Tanimoto similarity but their SMILES have a large Levenstein distance and see whether for these molecules the performance is worse.

This can obviously be wrong and usually I would not comment this on a GItHub repository but as it is a working paper I thought it could be useful.

Greetings, Janosch

JanoschMenke commented 1 year ago

I just saw you have SMILES augmentation in your code. So I guess you performed some experiments with it already

daenuprobst commented 1 year ago

Hey Janosch,

Thanks for your thoughts. I indeed played around with a lot of different ideas in regards to SMILES permutation etc. (from augmentation + concatenation and standard augmentation to using SECFP to bundle substructures). And everything I did seemed to reduce the performance.

I will be getting back to the project during the weekend, so I will probably try a few additional things...

Cheers, Daniel

daenuprobst / molzip

Robustness to SMILES Permutations #15