Open jonjoncardoso opened 3 years ago
No idea, sorry. Do the SMILES resolve to different structures, or are they representing the same molecules? @edwintse any idea? Very good you're tweaking and improving. Still very keen on improving these compounds' potency.
@jonjoncardoso I've had a check and it looks like the differences between the old and new SMILES is just the way of representing the aromatic rings (i.e. using a circle rather than the Kekule form). SMILES are always very dependant on how you draw the structure out so the InChI and InChI Key are more consistent.
The only exception is OSM-S-351 which was changed because the old strings were incorrect (i.e. should be 2,4-Cl instead of 2,3-Cl).
Thanks @mattodd and @edwintse, we will test with InChI/IChI keys to make sure our modelling is consistent.
Indeed some of the SMILES do resolve to slightly different structures.
Here are 2D visualizations of these structures (['OSM-S-82', 'OSM-S-88', 'OSM-S-89', 'OSM-S-351', 'OSM-S-546', 'OSM-S-631']
). The molecule constructed from the old SMILES is displayed on the left, the new one is displayed on the right. (OSM-S-351 is displayed correctly on the right as pointed by Edwin)
Hi everyone,
Our group at the Department of Informatics at King's College London - under Dr. Sophia Tsoka @sophiatsoka - have been revisiting this modelling challenge and we have some questions about changes in SMILES codes in the Master Chemical List.
Ruby (@yutongLi1997) has downloaded the newest version of the master list and compared it with the previous version I had from when I participated in Round #2 of the Competition.
She notice that the structures listed below were a bit different this time. My guess is that these compounds had the wrong SMILES and had been revised more recently but I couldn't locate the changes in the spreadsheet. Can anyone confirm this?
PS: What we have been up to
We are working on improving the accuracy of our algorithm (modSAR), assessing its weaknesses and limitations, while modelling OSM data.
The model I trained on Round 2 did not predict activity of the external test set that well even though the algorithm had performed well on previous datasets we've worked on. Changing from CDK molecular descriptors to more widely used RDKit circular fingerprints have already improved the fit and accuracy of the model in general, but we are still working on validating these results.
We are also planning to apply shapley values to help explain activity and to debug models results
I uploaded a Jupyter notebook to our repository with exploration on the earlier version of the dataset. Here is the link if anyone is interested.