Rose-STL-Lab / LIMO

generative model for drug discovery
59 stars 14 forks source link

About the metric #21

Closed tszslovewanpu closed 3 months ago

tszslovewanpu commented 3 months ago

Hello, i'm back again~

Take logp targetting as an example:

I generated molecules with a total count of N, where N1 are the molecules that meets the condition logp in (-2.5, -2). M are the molecules in N that are duplicates of the training set, and N2 are the molecules that meet the criterion in (N-M).

When calculating the proportion of molecules falling within the desired range for the logP targeting task, which of the following ratios should be used?

When calculating the similarity between generated molecules, which set should be used for this calculation?

Thank you!

tszslovewanpu commented 3 months ago

example

PeterEckmann1 commented 3 months ago

Hi, N1/N would be the correct ratio. We didn't take into account the training set at all when calculating the proportion of molecules in the desired range. And for the similarity calculation, we used the N1 set, because what we're interested in is how diverse the molecules are that meet the condition. Let me know if you have any more questions!

tszslovewanpu commented 3 months ago

Thank you! Happy children's day~

tszslovewanpu commented 3 months ago

Dear Peter: By the way, have you encountered a situation where there are invalid molecules in N? Should I include them when calculating the proportion? (I generated some invalud molecules, and don't know how to deal with them, include them or not. TAT) Thank you~ Best wishes ^v^

PeterEckmann1 commented 3 months ago

Hmm, not sure if I've seen that, since I thought SELFIES always decodes to valid SMILES. Can you send me an example? But in general, I would probably exclude them, since you can't calculate logP on them.

tszslovewanpu commented 3 months ago

Hi, dear Peter: I used a different representation that can be converted to SELFIES and then to SMILES. For example, there are elements like [Br1] and [@H1], I think it would be fairer to include these in the denominator N, as some methods generate 100% valid molecules, while mine do not. Thank you!

PeterEckmann1 commented 3 months ago

Ah okay, that makes sense. Then yes, I think you're right that including them in the denominator would make sense.