StructuralGenomicsConsortium / CNP15-Solubility-Analysis

0 stars 0 forks source link

Graph Analysis #1

Open ucnvwqt opened 10 months ago

ucnvwqt commented 10 months ago

Predicted vs experimental solubility cLogP vs exp solubility @mattodd Hi Professor Todd, I wonder if these are the graphs that you mentioned during our last meeting. For your information, I have excluded values above 10000uM to make the solubility graph looks better. Do you have any advice or comments on these graphs and what should I do next for the analysis of the graphs?

qxsml commented 10 months ago

Hi Vicky, do you have a graph of "predicted solubility against experimental solubility" that excluded points >300 uM on the y-axis, as we discussed last Friday?

mattodd commented 10 months ago

Hey there @ucnvwqt - great that you're posting updates.

Hmmm.

In the lower graph I guess you can see the points pretty well, i.e. the data are clear. There's no real correlation, but you can at least see everything. You say that you excluded values > 10000 uM, but how many points is that? I ask only because we kind of like compounds that are very soluble!

Can you think about adding a third piece of software onto the graph?

For the upper graph, things are hard to see. Lots of very low values creating a kind of orange baseline. Consider plotting log(predicted solubility)?

We will have to come back to this issue of salts vs neutral compounds. Maybe you can start a new issue on that. Does it make a difference how we draw the salts - either as two neutral compounds or as a salt (where the proton has transferred). Then we can start thinking about whether the software has more trouble with calculating solubility for salts - are these the biggest outliers?

ucnvwqt commented 10 months ago

Predicted LogS vs Exp LogS @mattodd here is the graph that I've plotted for log(solubility) and three different softwares are included in the graph as you requested above. Do you have any comments on the graph?

ucnvwqt commented 10 months ago

Hi Vicky, do you have a graph of "predicted solubility against experimental solubility" that excluded points >300 uM on the y-axis, as we discussed last Friday?

Predicted Solubility vs exp solubility ( 300) Hi Xin, here it is!

mattodd commented 10 months ago

OK @ucnvwqt in the above scheme it still looks like the points on the graph are quite heavily bunched together. I wonder what would happen if you use log solubility instead? Any better?

On the face of it the correspondence is, let's face it, dreadful. If you'd asked me before you started whether there would be a good correlation, I'd have said there would have been an ok correlation, with outliers. I'd not expected it to be so poor. So I'm keen that we try to make sure we're not missing anything obvious. One obvious thing is to try log values.

Another obvious thing is to see (as part of your lit analysis) whether anyone has done this kind of analysis before. You need to find those in any case, for the background to your thesis, so let's see whether the correlation if also bad.

A quick check: you are calculating "solubility", right? Not logP? I wonder if the experimental solubility is better correlated with some other calculated value, like logP or logD etc.

I'll write a new issue about the salts thing.

ucnvwqt commented 10 months ago

predicted vs exp solubility 2 @mattodd The graph attached is plotted using log S and I have included one more software into the graph. Instead of using solubility in uM, I think it is wiser to use log S to plot the graph as the impact of outliers on the graph would be much lower. You can compare it with the first graph on top.

Predicted log P vs Exp log S I have also plotted a graph using predicted log P vs experimental log S. What do you think?

mattodd commented 10 months ago

@ucnvwqt well I think these are certainly better representations, in terms of data scatter. The trends are a little easier to see, but obviously the correlations are still very weak.

I'm assuming that you get the same general scatter for each of our three datasets - the original, the additional, and the separate dataset from Han Wee Ong?

You could manually take a look at some of the biggest outliers - maybe take 10 compounds that are significantly off the line and see if there is anything notable about their structures?

If there is nothing obvious, then I think we need now to start comparing what we're finding with lit, which will be an important part of your thesis. In an ideal world you would find some published studies of experimental vs calculated solubilities and in that ideal world the paper would contain a dataset similar to the ones we have, that would allow you to check your method vs the published one, just to confirm that there is not something about your method that is unusual. In other words: are your results unexpected?

ucnvwqt commented 10 months ago

@mattodd the attached photo contains 6 compounds that are significantly off the line and i realised that there are lack of benzene rings in the structures of these compounds while most of the other compounds contains at least 2 to 3 benzene rings. I wonder if benzene rings would really have an effect on the solubility of the compounds. Furthermore, I have calculated the correlation coefficient (r) for each software in predicting log s and log p. Do you have any idea of what other statistical calculations should i include for the analysis? Apart from that, I prepare to talk about the structures of the significant outliers in the discussion section. What else do you think i can add to the discussion part of my thesis? outliers

mattodd commented 9 months ago

Hi @ucnvwqt - well, I'm not sure the outliers have much in common. One does indeed have a benzene ring and others have aromatic heterocyclic rings. The top one is very polar (lots of H-bonding going on). Four of the compounds have reactive functional groups that are likely to make them useful in the discovery of covalent inhibitors. But nothing's leaping out. You'd need to see whether any of these features exist also in compounds close to the trend line to see whether they might be responsible for their being outliers. In short, I don't yet know.