bp-kelley / descriptastorus

Descriptor computation(chemistry) and (optional) storage for machine learning
Other
220 stars 62 forks source link

Normalized rdkit feature #3

Closed kexul closed 4 years ago

kexul commented 4 years ago

Hi, thanks for your great job. May I ask which normalization method you've used for normalizing the rdkit feature? I noticed that the dists.py showed some value but what does that mean? And how was those digits calculated? Thanks!

bp-kelley commented 4 years ago

Sorry I missed this issue! It is embarrassing that the distribution generation code didn't make it in the respository, the raw data is there in:

https://github.com/bp-kelley/descriptastorus/tree/master/data/d_descriptors

The distributions are fairly straightforward:

['d_descriptors/d_BalabanJ.gz', ('nct', (4.182658638749994, 2.0965263482828114, 1.1271767054584343, 0.2616489125636474)), 0.0, 7.289359191119452, 1.8039183288289355, 0.47846986656304485]

The first entry is the underlying data for the descriptor, then the next is the distribution used with the parameters to send to the distributions constructor.

the code used was essentially from this stack overflow comment to fit the distributions:

https://stackoverflow.com/questions/6620471/fitting-empirical-distribution-to-theoretical-ones-with-scipy-python

then instead of the pdf, the cdf is used as shown here:

https://github.com/bp-kelley/descriptastorus/blob/master/descriptastorus/descriptors/rdNormalizedDescriptors.py

I hope this helps and I will certainly add a comment on how the distributions were fit in the next update. The current distributions are fairly poor for descriptors with only, say, three values reported. These cdfs can be better done using a histogram I think.

kexul commented 4 years ago

Thanks for your reply! It looks straightforward and easy to understand.