Closed StefanIsSmart closed 1 week ago
Any one could help me?
These are the distributions fit to the raw data in the gz files here: https://github.com/bp-kelley/descriptastorus/tree/e5a2e98973a4cfd62402d53faae49f838d160656/data/d_descriptors
which are a random sample from chembl as of six years ago
Thank you for pointing us to that folder of raw data @bp-kelley.
Two follow up questions:
I see the raw data is an array of numbers. Was the list of molecules randomly sampled also saved in addition to their descriptor values?
Do you know why certain distributions were chosen for each descriptor? For example, mielke
for ExactMolWt
, with clipping cutoffs 7.01545597009
and 7902.703267132
.
The distributions are fitted using this script:
https://github.com/bp-kelley/descriptastorus/blob/master/data/d_descriptors/make_histdists.py
There is another normalization method
RDKit2DHistogramNormalized
which only uses the raw values to make the CDF and doesn't fit distributions. This is quite a bit faster but hasn't been tested as heavily.
I regret that I didn't save the chembl molecule used to make the descriptors, I'll see if I have them floating around anywhere.
The molecules were randomly selected from ChEMBL btw..
If looks like I didn't fully answer the question, spipy distributions can be fit to pdfs: https://stackoverflow.com/questions/6620471/fitting-empirical-distribution-to-theoretical-ones-with-scipy-python
This was the technique used,
For the next major version, I'll have the compounds used for the distribution and the code used to make them in the Data directory.
Great work! But I have a sample question. Why do we use these values for normalization? How to get these values?