ehoogeboom / e3_diffusion_for_molecules

MIT License
432 stars 113 forks source link

About dataset_info in datasets_config.py #13

Closed lsnty5190 closed 10 months ago

lsnty5190 commented 2 years ago

Hii! Thanks for your impressive work on the molecular diffusion model! But I'm wondering how to leverage EDM on a custom dataset. The keys n_nodes and distances in datasets_config.py confuse me. How can I obtain these items from my custom dataset? I would really appreciate it if you could help. Thanks!

ehoogeboom commented 1 year ago

Hi, n_nodes is simply how much atoms a molecule (or points a point cloud) contains. It is used to build a histogram (which is a categorical distributions) so we can sample the number of atoms. So {1: 10000, 2: 13000} means that there are 10000 molecules with 1 atom and 13000 with 2.

About the "distances" I am not a 100% sure but I think it is something for analysis of samples after training. I don't think it is necessary to train / sample from a model. @vgsatorras might know this better

vgsatorras commented 1 year ago

distances has been calculted in this function: https://github.com/ehoogeboom/e3_diffusion_for_molecules/blob/fce07d701a2d2340f3522df588832c2c0f7e044a/qm9/analyze.py#L173

It is just the histogram of relative distances between atoms. But this is not really necessary to train the model. It is just for some analysis when comparing the distribution of relative distances between generated and sampled molecules.

Then n_nodes is the histogram on the number of nodes.

Best, Victor