cieplinski-tobiasz / smina-docking-benchmark

MIT License
67 stars 11 forks source link

how do you actually run it #4

Open mycode-bit opened 3 years ago

mycode-bit commented 3 years ago

what I can try is those notebook. but how can I actually run it?

how do you run docking_baselines/scripts/generate_molecules.py? I looked at parameter, it is not clear what to put there. Could you write some examples what should be done: for example: generate_molecules.py -p 5ht1b -m minimize --dataset toy, it shows model argument problem.

Could you provide details regarding this? Thanks Tom

cieplinski-tobiasz commented 3 years ago

Hello, I'll describe the script usage below, but please be aware the recommended way to use the package is the one described in notebooks, i.e. using smina wrapper functions with your model and storing results in OptimizedMolecules class. It is the most "user-friendly" way.

This script fine tunes a model using predefined dataset (which are selected by script's arguments) and runs three experiments, storing the results. First is just random sampling using Gauss distribution from model's latent space. Second experiment also takes random samples but they're optimized using gradient descent before decoding. Third one, which is really time consuming (you can turn it off by commenting out lines 68-71) does 50 steps of gradient descent, decoding latent space to SMILES at each step.

generate_molecules.py must be called with positional argument which selects model. We supply two models - cvae and gvae. There are also keyword arguments listed below. The bold ones are required. -p -- protein the docking score will be calculated for. You can choose from 5ht1b, 5ht2b, acm2, cyp2d6. Default is 5ht1b. -o -- path of output directory. OptimizedMolecules serialized classes will be stored there and .mol2 files. -n -- number of molecules to be generated as part of second experiment. Default is 250. -m -- maximize or minimize. Whether latent space optimization should be maximizing or minimizing the score. Default is minimize. -r -- number of molecules that are to be generated as part of first experiment. Default is 100. -f -- number of epochs for model fine-tuning. Default is 5. --dataset -- dataset to be used for fine-tuning. For every protein we supply four datasets - "default", "gauss", "ndhb" and "repulsion". Default dataset is dataset for docking score, gauss is one of components of docking score. ndhb and repulsion are also components of docking score function and are the targets used in paper. Default is "default", which is docking score. --n-cpu -- argument passed to smina - number of cores to be used during docking. Default is 4.

So the simplest example could be: generate_molecules.py gvae -o ./output_dir which will try to generate molecules with lowest docking score with respect to 5HT1B protein using GVAE model. Another example: generate_molecules.py cvae -f 10 -p 5ht2b --dataset repulsion -o ./output_dir will generate molecules with lowest repulsion with respect to 5HT2B protein using CVAE model after fine tuning it for 10 epochs.

I hope this makes it a bit more clear. Feel free to reach out if you have more questions.

Greetings, Tobiasz

mycode-bit commented 3 years ago

Hello, Thanks for your reply. It was working now, but there is another issue: I have 24 core of CPU, when I run it, only the terminal I was running "generate_molecules.py" is active, but all other terminals and whole linux are freezed. after the runn finished, everything is back to normal. I am wondering what is going on with the run( default is the using 4CPU).

cieplinski-tobiasz commented 3 years ago

It sounds strange... does it also happen when you set --n-cpu to 1?

mycode-bit commented 3 years ago

It is same. I guess it probably relates to the path of scripts. I run " docking_baselines/scripts/generate_molecules.py", it showed me ModulenotFoundError: no Mudule name 'docking_benchmark'. so I copied generate_molecules.py to the directory where docking_benchmark is sitting. Basically generate_molecules.py, docking_benchmark and docking_baselines are in the same folder. it worked, but showed the above error.

mycode-bit commented 3 years ago

I got error when I run notebook, with proteins-and-datasets.ipynb what could be the problem?

components = { 'gauss(o=0w=0.5c=8)': 0.5, 'hydrophobic(g=0.5b=1.5c=8)': 0.7 } smiles, scores = datasets.with_linear_combination_score('sabina_gauss1', **components)

KeyError Traceback (most recent call last)

in () 3 'hydrophobic(g=0.5__b=1.5__c=8)': 0.7 4 } ----> 5 smiles, scores = datasets.with_linear_combination_score('sabina_gauss1', **components) /home/ssb/smima_docking/smina-docking-benchmark-master/docking_benchmark/data/proteins.py in with_linear_combination_score(self, dataset_name, **component_weights) 139 in the dataset. 140 """ --> 141 dataset = self._datasets[dataset_name] 142 csv = pd.read_csv(os.path.join(self.protein.directory, dataset['path'])) 143 KeyError: 'sabina_gauss1'
kudkudak commented 3 years ago

@cieplinski-tobiasz Could you take a look?