StructuralGenomicsConsortium / CNP15-Solubility-Analysis

0 stars 0 forks source link

CNP15 Diary #6

Open qxsml opened 7 months ago

qxsml commented 7 months ago

After the project was conceived and established, @ucnvwqt began by plotting the main dataset in the simplest way, as either predicted solubility or cLogP vs. measured solubility here. Software included Chemdraw, Stoplight, Chemaxon and Datawarrior. The correlation was seen to be essentially zero. Marginally better correlation is seen by plotting measured and predicted logS. Through our analysis, we noticed that all the computational tools used in this study show perform much better in predicting the solubility of poorly soluble compounds. This observation is consistent across both datasets. Apart from solubility and log S analysis, we also carried out comparisons with log P predictions. The minor variations in correlation coefficient indicate a comparable predictive performance across the tools, with no single tool significantly outperform the others.

One possible cause for a poor correlation could be that many of the compounds are salts, and perhaps software does poorly with these compounds or the calculated values are poor because the salts are being misrepresented when drawn. Such an explanation seems unlikely since so few compounds in the dataset are salts, but this possibility is still being looked at. Apparently, the salts do not seem to contribute to the inaccuracy of the tools as there are only 11 salt compounds identified in the master dataset. Notably, the experimental solubility of these salt compounds were measured when they exist in their neutral form. Also, we realised that the tools like Stoplight and Chemaxon are particularly good in predicting the solubility of salt compounds when they are in the form of neutral pairs.

In our analysis, we have determined that the computational tools employed in this study do not show reasonable accuracy in predicting solubility of the tested compounds. However, they are more accurate in predicting less soluble compounds. Surprisingly, no clear reasons for the overall inaccuracy of these tools have been identified in this study. We found that chemaxon emerges as the most consistent tool for predicting log S, while Datawarrior exhibits reliability in predicting log P. It is crucial to note that despite these observations, the correlation coefficient for both tools remain relatively low, suggesting room for improvement. Importantly, it should be noted that there is an additional dataset of experimental values for aggregation. However, it was not available in time for the completion of this project. Given the inaccuracy in solubility prediction, future research can focus on how well the tools predict aggregation as well as finding the reasons behind the inaccuracy.