Closed kexul closed 4 years ago
Sorry I missed this issue! It is embarrassing that the distribution generation code didn't make it in the respository, the raw data is there in:
https://github.com/bp-kelley/descriptastorus/tree/master/data/d_descriptors
The distributions are fairly straightforward:
['d_descriptors/d_BalabanJ.gz', ('nct', (4.182658638749994, 2.0965263482828114, 1.1271767054584343, 0.2616489125636474)), 0.0, 7.289359191119452, 1.8039183288289355, 0.47846986656304485]
The first entry is the underlying data for the descriptor, then the next is the distribution used with the parameters to send to the distributions constructor.
the code used was essentially from this stack overflow comment to fit the distributions:
then instead of the pdf, the cdf is used as shown here:
I hope this helps and I will certainly add a comment on how the distributions were fit in the next update. The current distributions are fairly poor for descriptors with only, say, three values reported. These cdfs can be better done using a histogram I think.
Thanks for your reply! It looks straightforward and easy to understand.
Hi, thanks for your great job. May I ask which normalization method you've used for normalizing the rdkit feature? I noticed that the
dists.py
showed some value but what does that mean? And how was those digits calculated? Thanks!