Open UnixJunkie opened 8 months ago
Agree, chemaxon change of an academic license is inappropriate. I now consider different options. Previously I tested Dimophite-DL, pkasolver and chemaxon and at the first glance chemaxon performed better in complex cases. However, I would prefer a more systematic study. At least it would be reasonable to compare Dimorphite-DL and pkasolver. Unfortunately, this will take time, so it will not be fast, but we will try different alternatives. Actually, this could be an opportunity to develop something new
It is still possible to disable chemaxon protonation (--no_protonation
) and use some external tools to protonate structures before docking
as another hackish alternative, there would be obabel's -p option (Add Hydrogens appropriate for pH)
I implemented an easy way to integrate protonation tools. I also implemented dimorphite as an example in the separate branch dimorphite-example
.
What I dislike with dimorphite
run
), because it catches argparse of main program arguments. This requires fixes of dimorphite code.I would prefer if dimorphite will be converted into a package to be installed by pip and can be run using pool
on multiple cores. Ideally, if a bug with catching the main argparser using dimorphite API will be also fixed.
If someone will help with that, it will be faster to integrate dimorphote.
@Feriolet, would you be able to help with that?
Yes, these have been some of the pain points of using Dimorphite-DL for me as well. I have tried to run 250k compounds, but the init_db
ran slowly (around 7 hours). Dimirphite-DL may be the cause, but I have not tested it.
It is also annoying the argparse of Dimorphite is problematic so I cant be used it as a package, and the current solution I have uses subprocess.
I can try to fix the bug of Dimorphite_DL
I have found the reason for the argparse error for dimorphite_dl. The main reason that the dimorphite catches the main program argument is because of the command line args = vars(parser.parse_args())
where .parse_args()
collect the argument from the terminal as the default (main run_dock argument). To override the default setting and use the desired argument, we have to put the list ['--argument',value,'--argument',value]
inside the .parse_args(list)
.
Since the **kwargs
returns a dict_output, I have made a temporary function that converts the dict into a flat list:
def convert_kwargs_to_sys_argv(params: dict) -> list:
sys_argv_list = []
for key, value in params.items():
# Since some argument have store_true argument, we don't need to include the 'True' value into the sys.argv
if value is True:
sys_argv_list.append('--'+key)
elif value is False:
continue
else:
sys_argv_list.append('--'+key)
sys_argv_list.append(str(value))
return sys_argv_list
Then, the main()
function of the dimorphite_dl is modified to:
def main(params=None):
"""The main definition run when you call the script from the commandline.
:param params: The parameters to use. Entirely optional. If absent,
defaults to None, in which case argments will be taken from
those given at the command line.
:param params: dict, optional
:return: Returns a list of the SMILES strings return_as_list parameter is
True. Otherwise, returns None.
"""
sys_argv_list = convert_kwargs_to_sys_argv(params)
parser = ArgParseFuncs.get_args()
args = vars(parser.parse_args(sys_argv_list))
print(args)
if not args["silent"]:
print_header()
# Add in any parameters in params.
if params is not None:
for k, v in params.items():
args[k] = v
# If being run from the command line, print out all parameters.
if __name__ == "__main__":
if not args["silent"]:
print("\nPARAMETERS:\n")
for k in sorted(args.keys()):
print(k.rjust(13) + ": " + str(args[k]))
print("")
if args["test"]:
# Run tests.
TestFuncs.test()
else:
# Run protonation
if "output_file" in args and args["output_file"] is not None:
# An output file was specified, so write to that.
with open(args["output_file"], "w") as file:
for protonated_smi in Protonate(args):
file.write(protonated_smi + "\n")
elif "return_as_list" in args and args["return_as_list"] == True:
return list(Protonate(args))
else:
# No output file specified. Just print it to the screen.
for protonated_smi in Protonate(args):
print(protonated_smi)
Good first step. Thank you!
I checked the code of Dimorphite for parallelization and may say it can be not the most difficult part. The code of a generator class looks quite complex and I do not understand whether this complexity is necessary or not.
Yes, it will take time for me to understand the code as it uses multiple Classes in huge chunk as well.
I noticed that the Protonate()
class is an iterator with the __next()__
and __iter()__
. Is it even possible to use multiprocessing for this type of class? Calling the Protonate(args)
may immediately loop the smi list before we can multiprocess it.
EDIT: nvm it seems that a for
loop calles for the next()
function of an iterator. I'll see if I can call multiprocessing within/outside of the class type.
In theory it should be possible, but I do not have experience with such classes. I usually implement such kind of things in a different way
A very simple example:
with Pool(ncpu) as pool:
with open(smi_filename) as f:
for mol_id, protonated_smi in pool.imap_unordered(Protonate, f):
if protonated_smi:
update_db(mol_id, protonated_smi)
where Protonate
would take a line and returns mol_id
and protonated_smi
. f
is actually a generator that avoids memory overflow.
There are some examples of generators working with multiprocessing - https://codereview.stackexchange.com/questions/259029/compute-the-outputs-of-a-generator-in-parallel, but those implementations are also simpler.
I have tried to understand how iterator class work and I think I got a few ideas on what to do. The dimorphite class needs to take the input filename, and the iterator is generated by the LoadSMIFile()
class and stored in the args["smiles_and_data]
when the ArgParseFuncs.clean_args()
is called. Because of the nature of Iterator class which has the next(iterator)
function, the input must be an iterator type, not iterable type.
The above code probably won't work because the Protonate()
will treat f
as the args
input, which is not what we wanted.
My idea is that we can either change how the Protonate()
treats the input by separating the args["smiles_and_data"]
code from the args which means we have to add another argument Protonate.__init__(self, args, smi_iterator)
. But I don't know if it will break how dimorphite intends to run the code. Another ridiculous way is that we can split the smi files to n
number of smi (e.g., 250k smi will have 250k separate files), which can be easily multiprocess and we don't need to modify the code too much. But again, generating 250k files is kind of ridiculous.
If we plan to change the code, we can probably make the Protonate()
take the input of iter([iter(smi),iter(smi),...]
where smi represents one smi string with the following function:
def convert_list_to_iter_in_iter(input_list: list):
iter_list = [iter(str(element)) for element in input_list]
return iter(iter_list)
then, we can use pool.imap_unordered(Protonate, iter(iter_list))
so the input as treated as iterator rather than a list/string
The idea of multiple input files is very nice, because we nevertheless will store all these data on a disk and it does not matter in how many files.
We can create 2x-3x files as the number of CPUs submitted by a user in the command line (argument -c
), propagate this ncpu
argument to the add_protonation
function and use it to run multiples input files using pool
and subprocess
. Not a perfect solution but fast and working.
If the speed will be too slow, we may think to use dask
for this task, but I hope it will be reasonably good.
You may create a package of dimorphite-dl under your account. Uploading it to the pypi is not necessary, we may specify in README to install dimorphite directly from your repository.
By installing, do you mean through git clone
in the repo?
Btw, using 1 cpu to protonate 250k requires 1037s. So, I guess that using the ncpu would be sufficient if we want to scale up the protonation
I mean pip install git+...
. So I suggest to make it similarly to easydock
. Then we will be able to import it from easydock
and run directly from Python even without subprocess
(since we will solve the issue with command line arguments parsing).
Is Dimorphite ready to use in the branch ?
I mean
pip install git+...
. So I suggest to make it similarly toeasydock
. Then we will be able to import it fromeasydock
and run directly from Python even withoutsubprocess
(since we will solve the issue with command line arguments parsing).
Oh okay, but how is it different than using the dimorphite_example currently? I am forking that branch to modify the code to implement the current feature. I will try to make a pull request soon to fix the dimorphite implementation
@Samuel-gwb, yes, but it uses one core and for large sets this will take some time. 250k molecules ~ 20 min.
@Feriolet, the difference will be that we can completely remove directory and files of dimorphite from easydock package. Dimorphite will be installed as an individual package and imported similarly to the current import (only the name of the package will change) If you need more details, please ask
But we do still need to change the dimorphite source code right? I feel that it is inconvenient for the user to install it as a separate package and then alter it by themselves. Or is there something that I missed about installing it in a different directory from easydock?
Edit: nvm I didnt read that separating them will solve the issue of arg parse. Will do that!
Yes, the source code should be modified to enable mutiprocessing and correct treatment of command line arguments. A separate package will not solve the issue with command line arguments, so this should be explicitly fixed.
Installation as a package is preferable and convenient.
I can't install dimorphite through pip.
ERROR: git+https://github.com/durrantlab/dimorphite_dl does not appear to be a Python project: neither 'setup.py' nor 'pyproject.toml' found.
Yes, of course. Therefore, it should be converted into a Python package to enable installation from pip.
This is quite easy. You may check the structure of easydock
repository. You have to add setup.cfg
, setup.py
, LICENSE.txt
, README.md
, MANIFEST.md
to the root dir of your fork of the repository, move other files in a subdirectory named for example dimorphite
and add in that subdirectory file __init_.py
(it may be empty).
In the MANIFEST.md
you should list files which are not python scripts (*.py), but which should be supplied with the package. In dimorpfite this will be the file with patterns (at least for the first glance).
Example if the file - https://github.com/DrrDom/pmapper/blob/master/MANIFEST.in
Huh I tried creating the setup.cfg
and setup.py
for the dimorphite but it does not create a library in the conda environment (unless you considered egg-link
to be one). I made the dimorphite
directory in my Desktop
directory, then install it with pip install -e .
It works for now though. I have tried re-running it and it works. I forgot what min and max pH I used for my initial run, so I can't test if my code in my pull request actually affect the docking score. As I can't reproduce the same protonated smi and docking score for my code, maybe you can see if the code I send in the pull request is good enough.
Also, since we are creating the setup.cfg
and setup.py
, are we going to include them into our repository as well, or do we ask the user to do it on their own in the README
file?
Installing with pip install -e .
does not transfer files to the conda environment, but create links. This is almost like an ordinary installation. So, everything is OK.
You have to create your fork on github with dimorhite
code and edit it to fix problems which interfere with easydock
. Afterwards you can make PR to the original repository, If maintainers will accept it this will be excellent. If not, we can mention in easydock
README
to install dimorphite
from pip git+https://github.com/Feriolet/dimorphite_dl
. In the latter case you will become the maintainer of this specific version of dimorphite.
So, no dimorhite
files should be included in easydock
repo.
Seems that successfully using dimorphite_dl by "pip git+https://github.com/Feriolet/dimorphite_dl". Great work! Update the official code according to the branches?
Yes, tests were passed successfully. I merged all changes to master, but I'll postpone to release a new version, because there is one critical issue which should be somehow addressed.
i noticed that Dimorphite does not behave well with nitrogen centers and molecules having more than a single protonation center. Below are several test examples. In the cases 3 and 4 the nitrogen atom was not protonated as expected.
smi smi_protonated
Oc1ccccn1 [O-]c1ccccn1
O=c1cccc[nH]1 O=c1cccc[n-]1
NCC(=O)O NCC(=O)[O-]
C[C@@H]1CCCN(C)C1 C[C@@H]1CCCN(C)C1
Dimorphite was designed to enumerate a list of possible forms at a given pH and it cannot predict the most favorable one. The logic behind was to enumerate all reasonable forms and dock all of them. Since we limit the number of protonated forms to 1, Dimorphite takes the first one from the list and returns it.
Therefore, we have two options:
Alternative solution - pkasolver. Half a year ago I adapted pkasolver to predict the most stable protonation state. I took a form with pKa closest to 7.4. pkasolver also has some drawbacks and errors in assignment of protonation states. However, to some extent it can treat molecules with multiple centers (but the model which is publicly available was trained on single-center molecules only). I have to adapt previous implementation to the new program architecture and can make a PR to enable discussion of all solutions.
Therefore, in the beginning I mentioned that it would be good to study and compare all available solutions (free or open-source) to choose more appropriate one(s).
@Samuel-gwb sorry for the late reply but yes the code should have been merged. We will also update the README to include the dimorphite_dl fork as well.
While the latter one seems ideal, it may not work well with small molecules with multiple protonation state as you mentioned. I have tried to protonate this molecule CCCC[C@@H](CN[C@@H](CCCC)C(=O)[N-][C@@H](CCC([NH-])=O)C(=O)[N-][C@@H](CCCNC(N)=[NH2+])C([NH-])=O)[N-]C(=O)[C@@H]([N-]C(=O)[C@@H]([N-]C(C)=O)[C@@H](C)O)[C@@H](C)CC
in the past and dimorphite_dl produces over 128 protonated molecules (default limit if i remember correctly), which may significantly unnecessarily increase the computational time and resources.
Yes, it would be great if we can implement multiple solutions in the future if possible. Regarding pkasolver, did you mean by this https://github.com/mayrf/pkasolver ? From what I look briefly, it seems that we would need to use all possible protonated molecules as the input and return a single molecule with the closest pKa=7.
Yes, this is pkasolver repo. It takes a single molecule and enumerates multiple protonation forms with predicted pKa and I chose the one closest to 7.4. I'll try to adapt it quickly and make PR to show the idea and test it. With the new architecture it should be relatively easy to do.
You are right, I missed the issue of too many forms for a single molecule. It may really substantially increase docking time and may not be necessity. This is another cons.
Thanks for your quick response @DrrDom @Feriolet. And apprecieate your thinkings and efforts very much! Also like easydock very much. Really easy to do screen :-) Hope a efficient solution.
I uploaded the branch pkasolver2
. It is expected to work, however, I could not install dependencies for pkasolver
, therefore I did not test it, but previously almost the same code worked.
You may play with it. I remember also some strange behavior.
To run it just specify run_dock ... --protonation pkasolver
Got it! thanks
In case you are testing the code very soon, do change the code of protonation.py
line 11 from from read_input import read_input
to from easydock.read_input import read_input
to prevent the import error.
I am unable to test the pkasolver because of cudatoolkit in the moment, will update when I have the chance to do so.
Thank you, fixed! If someone will test it, please report whether it works and how.
I cannot create an environment from the yml-file and manual installation requires specific versions of packages because on my old system not all recent versions can be installed.
After download pkasolver2, conda activate easydock_pka -- install dependencies -- 1) pip install torch==1.13.1+cu116 torchvision==0.14.1+cu116 torchaudio==0.13.1 --extra-index-url https://download.pytorch.org/whl/cu116 2) pip install torch-geometric==2.0.1 torch-sparse torch-scatter molvs chembl_webresource_client 3) conda install -c conda-forge matplotlib pytest-cov codecov svgutils cairosvg ipython 4) python setup.py install
Then run_dock ... --protonation pkasolver -c 1 --sdf
Errors appear: ############################################### Process ForkPoolWorker-1: Traceback (most recent call last): File "/home/gwb/miniconda3/envs/easydock_pka/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap self.run() File "/home/gwb/miniconda3/envs/easydock_pka/lib/python3.9/multiprocessing/process.py", line 108, in run self._target(*self._args, **self._kwargs) File "/home/gwb/miniconda3/envs/easydock_pka/lib/python3.9/multiprocessing/pool.py", line 114, in worker task = get() File "/home/gwb/miniconda3/envs/easydock_pka/lib/python3.9/multiprocessing/queues.py", line 367, in get return _ForkingPickler.loads(res) File "/home/gwb/miniconda3/envs/easydock_pka/lib/python3.9/site-packages/torch/multiprocessing/reductions.py", line 120, in rebuild_cuda_tensor torch.cuda._lazy_init() File "/home/gwb/miniconda3/envs/easydock_pka/lib/python3.9/site-packages/torch/cuda/init.py", line 217, in _lazy_init raise RuntimeError( RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method ###############################################
Code in init.py is as (raise RuntimeError ... ): ############################################### def _lazy_init(): global _initialized, _queued_calls if is_initialized() or hasattr(_tls, 'is_initializing'): return with _initialization_lock:
# the above test was GIL protected anyway. The inner test
# is for when a thread blocked on some other thread which was
# doing the initialization; when they get the lock, they will
# find there is nothing left to do.
if is_initialized():
return
# It is important to prevent other threads from entering _lazy_init
# immediately, while we are still guaranteed to have the GIL, because some
# of the C calls we make below will release the GIL
if _is_in_bad_fork():
raise RuntimeError(
"Cannot re-initialize CUDA in forked subprocess. To use CUDA with "
"multiprocessing, you must use the 'spawn' start method")
###############################################
Is it possible to carry out pka_solver before run_dock? Something like: ensemble.smi --> ensemble_pka.smi and then run_dock ensemble_pka.smi ...
And ensemble.smi --> ensemble_pka.smi means, due to pka_solver : example_1 --> example_1 example_2 --> example_2_01 example_2_02 ... ...
This is mainly to avoid involve pka_solver into a sub/multi process, which seems leading to error.
Can I know how many GPU do you have on your machine?
It is possible to protonate your smiles first with separate script and then run the run_dock
with the --no_protonation
argument
2 GPUs. Need specification?
Depending on a GPU might also be annoying. It would be nice if the code can run on a regular laptop, which might not have any decent GPU. Some cluster nodes have very powerful CPUs but no GPU, etc.
On Tue, Apr 9, 2024 at 1:58 PM SamuelGong @.***> wrote:
2 GPUs. Need specification?
— Reply to this email directly, view it on GitHub https://github.com/ci-lab-cz/easydock/issues/17#issuecomment-2044148549, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAFUFAA7BA22DD56VNQWNJ3Y4NYOHAVCNFSM6AAAAABB6AWL2SVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANBUGE2DQNJUHE . You are receiving this because you authored the thread.Message ID: @.***>
yes that is true. I have not install the pkasolver because of CUDA compatibility issue right now, but if I may know, does pkasolver use cpu or gpu to run the program? If possible, we can force the pkasolver to use cpu instead of gpu to make our code consistent
Yes. By default, pkasolver use gpu, indicated by nvidia-smi.
@Samuel-gwb can you provide me with the full torch version and CUDA? I tried to use your version mentioned above but I get a segmentation fault (core dumped)
error
The QueryModel()
class seems to accomodate both GPU and CPU as can be seen in this code:
if torch.cuda.is_available() == False: # If only CPU is available
checkpoint = torch.load(
f"{base_path}/trained_model_without_epik/best_model_{i}.pt",
map_location=torch.device("cpu"),
)
else:
checkpoint = torch.load(
f"{base_path}/trained_model_without_epik/best_model_{i}.pt"
)
Please check the attachment
It seems that running on cpu has problem. I created a new environment with cpu version torch. However, errors appear as, using previous run_dock ... --protonation pkasolver :
File "/home/gwb/miniconda3/envs/easydock_pkaCPU/lib/python3.9/site-packages/torch/nn/modules/module.py", line 2153, in load_state_dict raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format( RuntimeError: Error(s) in loading state_dict for GINPairV1: Missing key(s) in state_dict: "GIN_p.convs.0.nn.lins.0.weight", "GIN_p.convs.0.nn.lins.0.bias", "GIN_p.convs.0.nn.lins.1.weight", "GIN_p.convs.0.nn.lins.1.bias", "GIN_p.convs.1.nn.lins.0.weight", "GIN_p.convs.1.nn.lins.0.bias", "GIN_p.convs.1.nn.lins.1.weight", "GIN_p.convs.1.nn.lins.1.bias", "GIN_p.convs.2.nn.lins.0.weight", "GIN_p.convs.2.nn.lins.0.bias", "GIN_p.convs.2.nn.lins.1.weight", "GIN_p.convs.2.nn.lins.1.bias", "GIN_p.convs.3.nn.lins.0.weight", "GIN_p.convs.3.nn.lins.0.bias", "GIN_p.convs.3.nn.lins.1.weight", "GIN_p.convs.3.nn.lins.1.bias", "GIN_d.convs.0.nn.lins.0.weight", "GIN_d.convs.0.nn.lins.0.bias", "GIN_d.convs.0.nn.lins.1.weight", "GIN_d.convs.0.nn.lins.1.bias", "GIN_d.convs.1.nn.lins.0.weight", "GIN_d.convs.1.nn.lins.0.bias", "GIN_d.convs.1.nn.lins.1.weight", "GIN_d.convs.1.nn.lins.1.bias", "GIN_d.convs.2.nn.lins.0.weight", "GIN_d.convs.2.nn.lins.0.bias", "GIN_d.convs.2.nn.lins.1.weight", "GIN_d.convs.2.nn.lins.1.bias", "GIN_d.convs.3.nn.lins.0.weight", "GIN_d.convs.3.nn.lins.0.bias", "GIN_d.convs.3.nn.lins.1.weight", "GIN_d.convs.3.nn.lins.1.bias". Unexpected key(s) in state_dict: "GIN_p.lin.weight", "GIN_p.lin.bias", "GIN_p.convs.0.nn.0.weight", "GIN_p.convs.0.nn.0.bias", "GIN_p.convs.0.nn.1.weight", "GIN_p.convs.0.nn.1.bias", "GIN_p.convs.0.nn.1.running_mean", "GIN_p.convs.0.nn.1.running_var", "GIN_p.convs.0.nn.1.num_batches_tracked", "GIN_p.convs.0.nn.3.weight", "GIN_p.convs.0.nn.3.bias", "GIN_p.convs.1.nn.0.weight", "GIN_p.convs.1.nn.0.bias", "GIN_p.convs.1.nn.1.weight", "GIN_p.convs.1.nn.1.bias", "GIN_p.convs.1.nn.1.running_mean", "GIN_p.convs.1.nn.1.running_var", "GIN_p.convs.1.nn.1.num_batches_tracked", "GIN_p.convs.1.nn.3.weight", "GIN_p.convs.1.nn.3.bias", "GIN_p.convs.2.nn.0.weight", "GIN_p.convs.2.nn.0.bias", "GIN_p.convs.2.nn.1.weight", "GIN_p.convs.2.nn.1.bias", "GIN_p.convs.2.nn.1.running_mean", "GIN_p.convs.2.nn.1.running_var", "GIN_p.convs.2.nn.1.num_batches_tracked", "GIN_p.convs.2.nn.3.weight", "GIN_p.convs.2.nn.3.bias", "GIN_p.convs.3.nn.0.weight", "GIN_p.convs.3.nn.0.bias", "GIN_p.convs.3.nn.1.weight", "GIN_p.convs.3.nn.1.bias", "GIN_p.convs.3.nn.1.running_mean", "GIN_p.convs.3.nn.1.running_var", "GIN_p.convs.3.nn.1.num_batches_tracked", "GIN_p.convs.3.nn.3.weight", "GIN_p.convs.3.nn.3.bias", "GIN_d.lin.weight", "GIN_d.lin.bias", "GIN_d.convs.0.nn.0.weight", "GIN_d.convs.0.nn.0.bias", "GIN_d.convs.0.nn.1.weight", "GIN_d.convs.0.nn.1.bias", "GIN_d.convs.0.nn.1.running_mean", "GIN_d.convs.0.nn.1.running_var", "GIN_d.convs.0.nn.1.num_batches_tracked", "GIN_d.convs.0.nn.3.weight", "GIN_d.convs.0.nn.3.bias", "GIN_d.convs.1.nn.0.weight", "GIN_d.convs.1.nn.0.bias", "GIN_d.convs.1.nn.1.weight", "GIN_d.convs.1.nn.1.bias", "GIN_d.convs.1.nn.1.running_mean", "GIN_d.convs.1.nn.1.running_var", "GIN_d.convs.1.nn.1.num_batches_tracked", "GIN_d.convs.1.nn.3.weight", "GIN_d.convs.1.nn.3.bias", "GIN_d.convs.2.nn.0.weight", "GIN_d.convs.2.nn.0.bias", "GIN_d.convs.2.nn.1.weight", "GIN_d.convs.2.nn.1.bias", "GIN_d.convs.2.nn.1.running_mean", "GIN_d.convs.2.nn.1.running_var", "GIN_d.convs.2.nn.1.num_batches_tracked", "GIN_d.convs.2.nn.3.weight", "GIN_d.convs.2.nn.3.bias", "GIN_d.convs.3.nn.0.weight", "GIN_d.convs.3.nn.0.bias", "GIN_d.convs.3.nn.1.weight", "GIN_d.convs.3.nn.1.bias", "GIN_d.convs.3.nn.1.running_mean", "GIN_d.convs.3.nn.1.running_var", "GIN_d.convs.3.nn.1.num_batches_tracked", "GIN_d.convs.3.nn.3.weight", "GIN_d.convs.3.nn.3.bias".
It seems that this will be quite troublesome.
I have finally solved the seg fault error by using a different version of torch by running the following code:
conda install pytorch==1.11.0 pytorch-cuda=11.7 pytorch_scatter pytorch_sparse pytorch_cluster pytorch_spline_conv pyg -c pytorch -c nvidia -c pyg
The RuntimeError: Error(s) in loading state_dict
Error mentioned by @Samuel-gwb can be solved by including strict=False
option in the .load_state_dict()
function in the query.py
file. However, I am not sure if this will affect the result of the protonation. Note that I only allow "cpu" and block the "cuda" device to only allow cpu to help protonate the smi.
class QueryModel:
def __init__(self):
self.models = []
for i in range(25):
model_name, model_class = "GINPair", GINPairV1
model = model_class(
num_node_features, num_edge_features, hidden_channels=96
)
base_path = path.dirname(__file__)
# if torch.cuda.is_available() == False: # If only CPU is available
checkpoint = torch.load(
f"{base_path}/trained_model_without_epik/best_model_{i}.pt",
map_location=torch.device("cpu"),
)
# else:
# checkpoint = torch.load(
# f"{base_path}/trained_model_without_epik/best_model_{i}.pt"
# )
model.load_state_dict(checkpoint["model_state_dict"], strict=False)
model.eval()
model.to(device=DEVICE)
self.models.append(model)
I have found that using Pool
for pkasolver can cause a multiprocessing error with the CUDA, since pkasolver calls for GPU and CPU with the torch.device()
. The error is the RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method
. This can be solved by using torch.multiprocessing.set_start_method()
to use the spawn
method.
from torch import multiprocessing as mp
def protonate_pkasolver(input_fname: str, output_fname: str, ncpu: int = 1):
mp.set_start_method('spawn', force=True)
from pkasolver.query import QueryModel
model = QueryModel()
pool = mp.Pool(ncpu)
with open(output_fname, 'wt') as f:
for smi, name in pool.imap_unordered(__protonate_pkasolver,
((mol, mol_name, model) for (mol, mol_name) in read_input(input_fname))):
f.write(f'{smi}\t{name}\n')
However, there are other errors called after calling torch.multiprocessing.Pool
which may be due to memory allocation ->
[W CudaIPCTypes.cpp:92] Producer process tried to deallocate over 1000 memory blocks referred by consumer processes. Deallocation might be significantly slowed down. We assume it will never going to be the case, but if it is, please file but to https://github.com/pytorch/pytorch [W CudaIPCTypes.cpp:15] Producer process has been terminated before all shared CUDA tensors released. See Note [Sharing CUDA tensors]
I am not sure how to deal with this.
Another funny thing is that the dimorphite_dl
inside the pkasolver
does not seem to store the output of the protonated_smiles. When running run_dock
, it gives the following output
CCCCC
<rdkit.Chem.rdchem.Mol object at 0x7fab1e8c6a50>
For help, use: python dimorphite_dl.py --help
If you use Dimorphite-DL in your research, please cite:
Ropp PJ, Kaminsky JC, Yablonski S, Durrant JD (2019) Dimorphite-DL: An
open-source program for enumerating the ionization states of drug-like small
molecules. J Cheminform 11:14. doi:10.1186/s13321-019-0336-9.
CCCCCCC
<rdkit.Chem.rdchem.Mol object at 0x7f6d1d82fa50>
For help, use: python dimorphite_dl.py --help
If you use Dimorphite-DL in your research, please cite:
Ropp PJ, Kaminsky JC, Yablonski S, Durrant JD (2019) Dimorphite-DL: An
open-source program for enumerating the ionization states of drug-like small
molecules. J Cheminform 11:14. doi:10.1186/s13321-019-0336-9.
We may need to modify and fix these errors if we still want to pursue pkasolver as our alternative
Finally, I found an old environment with installed pkasolver.
Since as a backend for enumeration of protonation states it uses dimorphite (and includes it within the package as is), there is the same issue with command line arguments, when dimorphite catches all input arguments from the main program. So, pkasolver
package should be modified, either to use the external custom dimorphite
package or by fixing this issue inside the pkasolver
package. I'll try to use the first option and modify the pkasolver
package. This will result in a custom pkasolver version, which should be installed from my repo.
That is strange that this error did not occur for you.
I do not have GPU on my laptop, so I cannot test how it works with GPU. Before, I did not have errors due to multiprocessing, but maybe just because i do not have GPU.
I agree that switching to CPU mode for pkasolver
is a good decision. We may customize further pkasolver
package to enable this control for a user. For example, by adding an argument to QueryModel.__init__
Yes, I guess it can be done.
I finally have installed pytorch-CPU only and it has similar issues with the GPU version. Using torch-2.2.0 (latest version) is sufficient since there is no CUDA limitation. https://pytorch-geometric.readthedocs.io/en/latest/notes/installation.html
Apparently, the memory allocation issue does not occur on the CPU, which is a good thing.
I forgot to mention that there is also this issue, which is odd, maybe because the pkasolver code does not catch the dimorphite output
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/user/miniforge3/envs/easydock/bin/run_dock", line 8, in <module>
sys.exit(main())
File "/home/user/miniforge3/envs/easydock/lib/python3.9/site-packages/easydock/run_dock.py", line 207, in main
add_protonation(args.output, program=args.protonation, tautomerize=not args.no_tautomerization, ncpu=args.ncpu)
File "/home/user/miniforge3/envs/easydock/lib/python3.9/site-packages/easydock/database.py", line 348, in add_protonation
protonate_func(input_fname=tmp.name, output_fname=output)
File "/home/user/miniforge3/envs/easydock/lib/python3.9/site-packages/easydock/protonation.py", line 107, in protonate_pkasolver
for smi, name in pool.imap_unordered(__protonate_pkasolver,
File "/home/user/miniforge3/envs/easydock/lib/python3.9/multiprocessing/pool.py", line 870, in next
raise value
TypeError: '<' not supported between instances of 'Mol' and 'Mol'
It seems that running on cpu has problem. I created a new environment with cpu version torch.
However, errors appear as, using previous run_dock ... --protonation pkasolver :
File "/home/gwb/miniconda3/envs/easydock_pkaCPU/lib/python3.9/site-packages/torch/nn/modules/module.py", line 2153, in load_state_dict
raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for GINPairV1:
Missing key(s) in state_dict: "GIN_p.convs.0.nn.lins.0.weight", "GIN_p.convs.0.nn.lins.0.bias", "GIN_p.convs.0.nn.lins.1.weight", "GIN_p.convs.0.nn.lins.1.bias", "GIN_p.convs.1.nn.lins.0.weight", "GIN_p.convs.1.nn.lins.0.bias", "GIN_p.convs.1.nn.lins.1.weight", "GIN_p.convs.1.nn.lins.1.bias", "GIN_p.convs.2.nn.lins.0.weight", "GIN_p.convs.2.nn.lins.0.bias", "GIN_p.convs.2.nn.lins.1.weight", "GIN_p.convs.2.nn.lins.1.bias", "GIN_p.convs.3.nn.lins.0.weight", "GIN_p.convs.3.nn.lins.0.bias", "GIN_p.convs.3.nn.lins.1.weight", "GIN_p.convs.3.nn.lins.1.bias", "GIN_d.convs.0.nn.lins.0.weight", "GIN_d.convs.0.nn.lins.0.bias", "GIN_d.convs.0.nn.lins.1.weight", "GIN_d.convs.0.nn.lins.1.bias", "GIN_d.convs.1.nn.lins.0.weight", "GIN_d.convs.1.nn.lins.0.bias", "GIN_d.convs.1.nn.lins.1.weight", "GIN_d.convs.1.nn.lins.1.bias", "GIN_d.convs.2.nn.lins.0.weight", "GIN_d.convs.2.nn.lins.0.bias", "GIN_d.convs.2.nn.lins.1.weight", "GIN_d.convs.2.nn.lins.1.bias", "GIN_d.convs.3.nn.lins.0.weight", "GIN_d.convs.3.nn.lins.0.bias", "GIN_d.convs.3.nn.lins.1.weight", "GIN_d.convs.3.nn.lins.1.bias". Unexpected key(s) in state_dict: "GIN_p.lin.weight", "GIN_p.lin.bias", "GIN_p.convs.0.nn.0.weight", "GIN_p.convs.0.nn.0.bias", "GIN_p.convs.0.nn.1.weight", "GIN_p.convs.0.nn.1.bias", "GIN_p.convs.0.nn.1.running_mean", "GIN_p.convs.0.nn.1.running_var", "GIN_p.convs.0.nn.1.num_batches_tracked", "GIN_p.convs.0.nn.3.weight", "GIN_p.convs.0.nn.3.bias", "GIN_p.convs.1.nn.0.weight", "GIN_p.convs.1.nn.0.bias", "GIN_p.convs.1.nn.1.weight", "GIN_p.convs.1.nn.1.bias", "GIN_p.convs.1.nn.1.running_mean", "GIN_p.convs.1.nn.1.running_var", "GIN_p.convs.1.nn.1.num_batches_tracked", "GIN_p.convs.1.nn.3.weight", "GIN_p.convs.1.nn.3.bias", "GIN_p.convs.2.nn.0.weight", "GIN_p.convs.2.nn.0.bias", "GIN_p.convs.2.nn.1.weight", "GIN_p.convs.2.nn.1.bias", "GIN_p.convs.2.nn.1.running_mean", "GIN_p.convs.2.nn.1.running_var", "GIN_p.convs.2.nn.1.num_batches_tracked", "GIN_p.convs.2.nn.3.weight", "GIN_p.convs.2.nn.3.bias", "GIN_p.convs.3.nn.0.weight", "GIN_p.convs.3.nn.0.bias", "GIN_p.convs.3.nn.1.weight", "GIN_p.convs.3.nn.1.bias", "GIN_p.convs.3.nn.1.running_mean", "GIN_p.convs.3.nn.1.running_var", "GIN_p.convs.3.nn.1.num_batches_tracked", "GIN_p.convs.3.nn.3.weight", "GIN_p.convs.3.nn.3.bias", "GIN_d.lin.weight", "GIN_d.lin.bias", "GIN_d.convs.0.nn.0.weight", "GIN_d.convs.0.nn.0.bias", "GIN_d.convs.0.nn.1.weight", "GIN_d.convs.0.nn.1.bias", "GIN_d.convs.0.nn.1.running_mean", "GIN_d.convs.0.nn.1.running_var", "GIN_d.convs.0.nn.1.num_batches_tracked", "GIN_d.convs.0.nn.3.weight", "GIN_d.convs.0.nn.3.bias", "GIN_d.convs.1.nn.0.weight", "GIN_d.convs.1.nn.0.bias", "GIN_d.convs.1.nn.1.weight", "GIN_d.convs.1.nn.1.bias", "GIN_d.convs.1.nn.1.running_mean", "GIN_d.convs.1.nn.1.running_var", "GIN_d.convs.1.nn.1.num_batches_tracked", "GIN_d.convs.1.nn.3.weight", "GIN_d.convs.1.nn.3.bias", "GIN_d.convs.2.nn.0.weight", "GIN_d.convs.2.nn.0.bias", "GIN_d.convs.2.nn.1.weight", "GIN_d.convs.2.nn.1.bias", "GIN_d.convs.2.nn.1.running_mean", "GIN_d.convs.2.nn.1.running_var", "GIN_d.convs.2.nn.1.num_batches_tracked", "GIN_d.convs.2.nn.3.weight", "GIN_d.convs.2.nn.3.bias", "GIN_d.convs.3.nn.0.weight", "GIN_d.convs.3.nn.0.bias", "GIN_d.convs.3.nn.1.weight", "GIN_d.convs.3.nn.1.bias", "GIN_d.convs.3.nn.1.running_mean", "GIN_d.convs.3.nn.1.running_var", "GIN_d.convs.3.nn.1.num_batches_tracked", "GIN_d.convs.3.nn.3.weight", "GIN_d.convs.3.nn.3.bias".
Btw I just realised that the only difference between these two dictionaries are the "lins". Probably should not immediately use "strict=False" since it will ignore most of the keys then.
OK, I got it why I had different issues, I previously modified dimorphite
within pkasolver
to solve the issue below.
Originally, pkasolver
runs dimorphite
via subprocess
. This implementation is not compatible with multiprocessing
, because every molecules is processed separately and the result is saved in the file with the same name test.pkl
for all molecules. Therefore, it can be overwritten at any time or corrupted. I tried to use Python interface of dimorphite, but there is an issue with command lines arguments, which I did not observe previously for some reason.
Trying to import the necessary function from custom Feriolet/dimorphite
package does not work as well. There is an argument return_as_list
which is not available in command line args but used by the authors of dimorphite.
Finally:
Feriolet/dimorphite
to fix parsing command line arguments. This is an ugly solution, but works for me.DrrDom/pkasolver
branch noprints
, where I replaced built-in dimorphite with Feriolet/dimophite
easydock
branch pkasolver2
, where I disabled interception of strerr/stdout (there was an issue with the previous solution which I could not solve quickly)Current installation: pip install git+https://github.com/Feriolet/dimorphite_dl.git (after PR will be accepted) pip install git+https://github.com/DrrDom/pkasolver.git@noprints pip install git+https://github.com/ci-lab-cz/easydock.git@pkasolver2
Now it works for me. However, there are still errors in structures.
Example:
input
O=C(N1CCN(CC1)C(=O)C=2C=CC=C(C#CC3CC3)C2)C=4NN=C5CCCC45
output
O=C(c1cccc(C#CC2CC2)c1)N1CC[NH](C(=O)c2[nH]nc3c2CCC3)CC1
It adds hydrogen to a nitrogen in a piperazine ring, but does not add charge. RDKit cannot parse this molecule as incorrect (a radical). Moreover, this protonation is definitely wrong, because this nitrogen cannot be protonated at pH 7.4. Such molecules are skipped from docking.
Please, check whether this implementation is compatible with computers with GPUs or there are still issues which should be solved.
maybe switch to Dimorphite-DL: https://jcheminf.biomedcentral.com/articles/10.1186/s13321-019-0336-9
I am in academia and I don't even have a chemaxon license anymore... Software vendors always change their license terms one day or another...