IBM / pychemex

Python library for Cheminformatics ML model explainability
Apache License 2.0
0 stars 1 forks source link

H10: What is your Chemical Space? #19

Open Alex-AMC opened 3 years ago

Alex-AMC commented 3 years ago

Problem

It's always annoying as a scientist when you are not sure exactly what kind of chemical space your data covers. Typically people will say to you:

But how do you know you're covering the correct chemical space?

Idea

A set of simple descriptors should be used to describe the possible chemical space and the space your data occupies.

Current suggestions : Molecular Weight, Fragment Types

Not sure how to best present this information? % Complete? I.e. of the range of fragments available your dataset covers x% as a whole and the least/most counts are as follows....

JJanowiak commented 3 years ago

I feel like asking about the chemical space is always a question to make one look smarter. I never found a concrete definition of what that actually means.

Alluded to this in #17 , and I think we need to discuss this and agree as to what that means for the project.

My thinking on this is:

"true" chemical space Defined by "chemistry" and covers all complexities of molecules within a dataset. It's what we're trying to capture mathematically in cheminformatics.

practical chemical space Defined by the feature space used and limited by the values of features of the molecules in the dataset. (i.e. if I have feature x_1 and x_2, the chem-space is a 2D plane limited by min(x_1), max(x_1), min(x_2), and max(x_2). How well the chem-space is covered could be approximated by the distribution of density of samples in this feature space. This could be compared across datasets to see how well the training set covers the chem-space of some unseen dataset, and maybe give a measure of extrapolatability. The obvious issue here is that we don't know how well the defined feature space captures the underlying true-chemical space (e.g. if our feature space is defined solely by molecular weight, then if our dataset has a bunch of alkanes, it'll give an impression that it covers all possible molecules)

What we should do stick to the practical chemical space and let the individual users worry about how that relates to the "true" chem-space (with aid of tools like #17 etc. Hence, we should focus on developing analytics and visualisation tools that will help users understand how well covered their chem-space is and how datasets compare to each other.

Alex-AMC commented 3 years ago

@JJanowiak I agree with your definition of practical chemical space. That's really what it is. Typically when people ask about the chemical space it refers to a couple of things.

I think working on feature vectors as you've described is the right way to go about it. But I would be tempted to suggest some of the above as defaults of how to use it.