Open PikaQiu520521 opened 1 year ago
Hello, thanks for your questions.
test.py
script you have a number of post-processing options (see here). You can for example relax the generated molecules in a force field and remove disconnected fragments. You will find the final molecules in the processed/
folder. However, we also save the same molecules without any post-processing applied (apart from adding bonds which does not change the atoms' chemical types or coordinates) in the raw/
folder. These can be used for different kinds of analyses later on or to explore alternative post-processing options.
We also measure the time it takes to generate ligands for each test set pocket. These measurements are stored in the pocket_times/
directory.I hope this answers your questions.
Hi, thank you very much for your reply. The first reply answered my question very well, but for the second question, I still want to know more about it, I wonder if it is convenient. Observing the files generated by raw and those generated by processed, it can be found that processed only uses the first line of data of each processing result in the raw folder. I haven't seen the processing process yet, and I am confused about it. When visualizing the data, I found that these points are very discrete. Using openbabel for post-processing, I can’t generate new molecules. I don’t know where the problem is, can you answer it?
Hello, what was the result of the 'Diversity' indicator when you evaluate it? Whether reading molecules from "raw" or "processed", my result is 0, and the value of SA is much smaller than the results in the paper.
The following are the results of evaluating the molecules in raw:
QED: 0.495 \pm 0.08
SA: 0.234 \pm 0.03
LogP: 0.021 \pm 1.10
Lipinski: 4.883 \pm 0.34
Diversity: 0.000 \pm 0.00
Hello @pearl-rabbit, the diversity is usually zero when only a single molecule is generated for each protein pocket because we compute this value per target (and afterwards mean and standard deviation across all targets). How many molecules did you generate? Also, which model did you use? Did you train one yourself or did you use one of the provided checkpoints?
Hi @arneschneuing , thank you for your timely response. I set nsamples=100 and batch size=60 in test.py. If I want to generate multiple molecules for a protein pocket, should I set the 'n_samples' value higher? I use the retrained all atom model, and except for the cutoff value set to 1, all other parameters have not been changed.The dataset is CrossDocked Benchmark.
Hi, n_samples=100
should be fine. It means that 100 valid molecules are generated per pocket. How did you provide those molecules to the evaluate()
function? Did you create a nested list?
A cutoff value of 1 [Å] seems very low. It is less than a typical bond length. Maybe you should consider a higher threshold to create sufficient edges.
@PikaQiu520521, I'm sorry but I don't really understand the question.. What do you mean by 'discrete points'? Could you provide an example?
I will try training again. Is it appropriate to set the cutoff to 3 (my server may not be able to support larger cutoff values)? This is the code for reading molecules:
from analysis.metrics import MoleculeProperties
from rdkit import Chem
import os
filePath = 'outdir/raw'
filenames = os.listdir(filePath)
pocket_list = []
for filename in filenames:
suppl = Chem.SDMolSupplier(filePath + filename, sanitize=False)
mols = [mol for mol in suppl if mol]
pocket_list.append(mols)
mol_metrics = MoleculeProperties()
all_qed, all_sa, all_logp, all_lipinski, per_pocket_diversity = mol_metrics.evaluate(pocket_list)
'filePath' is the path of the test set sampling results, storing 100 sdf files, each containing the coordinates and atomic types of molecules. (Due to some issues with the server where I store the files, I am unable to provide an example of sdf here temporarily.)
A cutoff of 3.0 still seems rather low but it might work because information is also propagated through several layers of message passing. However, I haven't tried this value myself yet and can therefore not say if the scores will be similar to the ones from the paper.
Your code for reading molecules looks fine to me. Could you please check how many items your pocket_list
and each list within pocket_list
contain before you pass them to evaluate()
?
print(len(pocket_list))
print([len(x) for x in pocket_list])
The output is:
100
[120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120]
If there is no problem, it may be due to the value of 'cutoff'.
It's hard to tell. The cutoff value could cause your molecules to be less realistic but I'm still surprised by the 0.0
diversity value. Have you visually inspected some of the generated molecules? Do they look more or less like molecules (several atoms connected by bonds) or is there some obvious failure mode (e.g. it always outputs disconnected point clouds)?
I didn't carefully analyze the reason, just replaced the original method of calculating diversity with the code in the original annotation section, and obtained a non zero value. https://github.com/arneschneuing/DiffSBDD/blob/main/analysis/metrics.py#L196 I checked the sdf file and found that the generated molecules have no edges. AndI retrained the model and obtained effective results evaluating with original code.
By the way, in the text, regarding ca_only model doesn't have a cutoff set, but it's too big for me , I set a limit on the cutoff and achieved effective results. Does this value only affect the number of edges when generating the graph?
Yes, it determines which atoms (nodes) are connected in the graph that the neural network processes.
Hello, thank you for your previous question. Unfortunately, I was not able to fully replicate the process, but I used the same calculation method to evaluate the ligands generated by the model. I have two more questions now. 1. The generated ligand files have "row" and "processed" versions, as well as another folder containing paths and scores. I am not sure what they specifically mean. Could you provide some explanation? 2. The generated ligands do not have the bond relationships between atoms, so they are actually discrete points. How do you think about this issue?