Open UnixJunkie opened 8 months ago
I have not tested CPU, but the result from GPU seems to almost match with the smi that you tested. I have to add the strict=False
in the QueryModel for it to work (not sure why it does not occur on your side).
I also have to modify some of the script to make the subprocessing works:
def protonate_pkasolver(input_fname: str, output_fname: str, ncpu: int = 1):
from pkasolver.query import QueryModel
from torch import multiprocessing as mp
mp.set_start_method('spawn', force=True)
model = QueryModel()
pool = mp.Pool(ncpu)
with open(output_fname, 'wt') as f:
##pool.close() and .join() are used to erase the ( Producer process has been terminated before all shared CUDA tensors released.) warning
pkasolver_output = pool.imap_unordered(partial(__protonate_pkasolver, model=model), read_input(input_fname))
pool.close()
pool.join()
for smi, name in pkasolver_output:
f.write(f'{smi}\t{name}\n')
Note that I use the mp.Pool(ncpu)
argument to access the GPU memory, which is not a good idea since it can lead to out of memory
error if we use a high number of ncpu
. If we still want to use GPU for this, we probably need to add a ngpu
argument for this function (and its parent function).
Here is the result of the smi test:
I also tried to protonate 100 smiles used for my previous work (using --protonation dimorphite
, and I notice that using --protonation pkasovler
tends to deprotonate the smiles. I am not sure which one is correct.
input: COC(=O)c1nc2[nH]ccc2cc1-c1cccc(C(C)C)c1
dimo : COC(=O)c1nc2[nH]ccc2cc1-c1cccc(C(C)C)c1
pkas : COC(=O)c1nc2[n-]ccc2cc1-c1cccc(C(C)C)c1
input: CC(C)c1cccc(-c2ncc(F)c3[nH]ccc23)c1
dimo : CC(C)c1cccc(-c2ncc(F)c3[nH]ccc23)c1
pkas : CC(C)c1cccc(-c2ncc(F)c3[n-]ccc23)c1
For CPU, I can reproduce the same result as the GPU. However, the CPU gives this error:
File "/home/user/miniforge3/envs/easydock/lib/python3.9/multiprocessing/pool.py", line 268, in __del__
File "/home/user/miniforge3/envs/easydock/lib/python3.9/multiprocessing/queues.py", line 371, in put
AttributeError: 'NoneType' object has no attribute 'dumps'
I resolve this issue by using the pool.close()
and pool.join()
similar to the GPU ones (default multiprocessing instead from the torch version).
def protonate_pkasolver(input_fname: str, output_fname: str, ncpu: int = 1):
from pkasolver.query import QueryModel
model = QueryModel()
pool = Pool(ncpu)
with open(output_fname, 'wt') as f:
pkasolver_output = pool.imap_unordered(partial(__protonate_pkasolver, model=model), read_input(input_fname))
pool.close()
pool.join()
for smi, name in pkasolver_output:
f.write(f'{smi}\t{name}\n')
I also need to use strict=False
for this.
btw, some of the molecules print this while being protonated by the pkasolver. Is this something that we need to silent?
#########################
Could not identify any ionizable group. Aborting.
#########################
It seems that the calculation speed on CPU is sufficient. If so, we may force to use CPUs in pkasolver
to avoid implementation issues with GPU. For example by adding force_cpu
and call model = QueryModel(force_cpu=True)
class QueryModel:
def __init__(self, force_cpu=False):
self.models = []
for i in range(25):
model_name, model_class = "GINPair", GINPairV1
model = model_class(
num_node_features, num_edge_features, hidden_channels=96
)
base_path = path.dirname(__file__)
if not force_cpu and torch.cuda.is_available() == False: # If only CPU is available
checkpoint = torch.load(
f"{base_path}/trained_model_without_epik/best_model_{i}.pt",
map_location=torch.device("cpu"),
)
else:
checkpoint = torch.load(
f"{base_path}/trained_model_without_epik/best_model_{i}.pt"
)
model.load_state_dict(checkpoint["model_state_dict"])
model.eval()
model.to(device=DEVICE)
self.models.append(model)
I do not know why you need strict=False for CPU-mode. For me, both options work. So, we may add it if it necessary. However, i do not fully understand what issues it may cause in future.
Introducing ngpu
argument will complicate the interface, because we will need to pass it through all function calls. And it will require a separate implementation. If this does not bring substantial speed advantage I would suggest to avoid this.
The errors caused by multiprocessing
and fixing them with pool.join()
and pool.close()
are unexpected. Never met such issues. Below I attached my env configuration. Maybe you use more recent python or module version where some changes were made.
The messages from pkasolver should be suppressed. I did it with nostd
context, but now it raises an error and I did not have time to fix it. You may uncomment that line and test it. Or you may suggest another solution. As far as I remember, my solution intercepted only particular stdrerr/stdout, not globally.
I was more surprised with your output of protonated smiles. Below is my output for the same structures and I generally agree with that. There were a lot of differences. Did you use the latest easydock version from noprints
branch?
I finally can run it without any of those issue. I have reinstalled everything from scratch and the problem goes away. i guess there are conflicting torch version that I installed and messed up with the other packages.
For future reference:
conda create -n easydock -c conda-forge python=3.9 numpy=1.20 rdkit scipy dask distributed
conda activate easydock
pip install paramiko meeko vina
pip install git+https://github.com/Feriolet/dimorphite_dl.git
pip install git+https://github.com/DrrDom/pkasolver.git@noprints
pip install git+https://github.com/ci-lab-cz/easydock.git@pkasolver2
pip install torch==1.13.1+cpu --extra-index-url https://download.pytorch.org/whl/cpu
pip install torch-geometric==2.0.1
pip install torch_scatter==2.1.1+pt113cpu -f https://data.pyg.org/whl/torch-1.13.1%2Bcpu.html
pip install torch_sparse==0.6.17+pt113cpu -f https://data.pyg.org/whl/torch-1.13.1%2Bcpu.html
pip install torch_spline_conv==1.2.2+pt113cpu -f https://data.pyg.org/whl/torch-1.13.1%2Bcpu.html
pip install molvs chembl_webresource_client matplotlib pytest-cov codecov svgutils cairosvg ipython
Yes, I agree that we should use CPU for convenience and consistency across other protonation software (dimorphite_dl and chemaxon). To silent the protonation output, I have added the contextlib package to the add_protonation()
function inside the database.py
file:
if program == 'chemaxon':
protonate_func = partial(protonate_chemaxon, tautomerize=tautomerize)
read_func = read_protonate_chemaxon
elif program == 'dimorphite':
protonate_func = partial(protonate_dimorphite, ncpu=ncpu)
read_func = read_smiles
elif program == 'pkasolver':
protonate_func = partial(protonate_pkasolver, ncpu=ncpu)
read_func = read_smiles
else:
protonate_func = empty_func
read_func = empty_generator
with contextlib.redirect_stdout(None):
protonate_func(input_fname=tmp.name, output_fname=output)
Many thanks for the installation notes. We will include them to the README.
Is now the output identical to mine, no difference in protonation states? If so, I will merge everything to master branches and add this solution with contextlib
.
Finally, I will keep dimorphite implementation inside the code, but will remove it from the command line interface, because currently it is not useful and will confuse users only.
Yes the protonated smiles are identical to the most recent one you showed
One more question. Does it work on computers with GPU? Should we add force_cpu
option or not?
I have not tested the GPU from scratch. It should also work given that it has the same result as CPU for the previous protonation (as in yesterday's result). I'll update you once I can test it on the GPU.
Assuming that the users will follow the torch installation, there may be no need to yse force_cpu=True
. I guess it can be a good option if you want to make sure that people who accidentally installed torch-cuda got the warning to only use CPU.
I updated easydock/pkasolver2
and pkasolver/main
.
The minor issue which is remained - enumeration of stereoisomers after protonation, e.g. a new unspecified chiral center will appear in C[C@@H]1CCCN(C)C1
after protonation. I'm thinking how to do that with minimal code perturbation and maximum flexibility for the future changes. It may worth to redesign init_db
and pull the function get_isomers
out of it and apply it only after protonated molecules were generated.
if not os.path.isfile(args.output):
create_db(args.output, args)
init_db(args.output, args.input, args.prefix)
else:
args_dict, tmpfiles = restore_setup_from_db(args.output)
# this will ignore stored values of those args which were supplied via command line
# command line args have precedence over stored ones
for arg in supplied_args:
del args_dict[arg]
args.__dict__.update(args_dict)
dask_client = create_dask_client(args.hostfile)
if args.protonation:
add_protonation(args.output, program=args.protonation, tautomerize=not args.no_tautomerization, ncpu=args.ncpu)
populate_stereoisomers(args.output, args.max_stereoisomers)
However, this will create an issue, that we will have records with identical smi
, different stereo_id
and different protonated_smi
, that is very misleading and may result in many issues in future. a solution may be to introduce an additional filed to DB protonated_id
and enable that a single molecule (smiles) may have several protonation states (which were not alternative protonation states in sense of dimorphite, but different stereoisomers appearing after protonation). I'm not confident with this solution, because it will complicate logic of functions and data manipulation. However, i do not see a better alternative.
Currently I tend to ignore this issue and postpone its solution for future.
Some error use newst environment and run_dock ... --protonation pkasolver ... : ############################ File ".../miniconda3/envs/easydock_pka/lib/python3.9/site-packages/torch/multiprocessing/reductions.py", line 366, in reduce_storage fd, size = storage._share_fdcpu()
RuntimeError: unable to open shared memory object in read-write mode: Too many open files (24) #############################
But it's successful to use: run_dock ... --protonation dimorphite ...
I finally can run it without any of those issue. I have reinstalled everything from scratch and the problem goes away. i guess there are conflicting torch version that I installed and messed up with the other packages.
For future reference:
conda create -n easydock -c conda-forge python=3.9 numpy=1.20 rdkit scipy dask distributed conda activate easydock pip install paramiko meeko vina pip install git+https://github.com/Feriolet/dimorphite_dl.git pip install git+https://github.com/DrrDom/pkasolver.git@noprints pip install git+https://github.com/ci-lab-cz/easydock.git@pkasolver2 pip install torch==1.13.1+cpu --extra-index-url https://download.pytorch.org/whl/cpu pip install torch-geometric==2.0.1 pip install torch_scatter==2.1.1+pt113cpu -f https://data.pyg.org/whl/torch-1.13.1%2Bcpu.html pip install torch_sparse==0.6.17+pt113cpu -f https://data.pyg.org/whl/torch-1.13.1%2Bcpu.html pip install torch_spline_conv==1.2.2+pt113cpu -f https://data.pyg.org/whl/torch-1.13.1%2Bcpu.html
Yes, I agree that we should use CPU for convenience and consistency across other protonation software (dimorphite_dl and chemaxon). To silent the protonation output, I have added the contextlib package to the
add_protonation()
function inside thedatabase.py
file:if program == 'chemaxon': protonate_func = partial(protonate_chemaxon, tautomerize=tautomerize) read_func = read_protonate_chemaxon elif program == 'dimorphite': protonate_func = partial(protonate_dimorphite, ncpu=ncpu) read_func = read_smiles elif program == 'pkasolver': protonate_func = partial(protonate_pkasolver, ncpu=ncpu) read_func = read_smiles else: protonate_func = empty_func read_func = empty_generator with contextlib.redirect_stdout(None): protonate_func(input_fname=tmp.name, output_fname=output)
Have you tried reinstalling the conda environment from scratch? The error seems to be caused by torch.multiprocessing, but I am not sure if the default multiprocessing can call the torch multiprocessing.
Also, how many CPU did you use?
Yes, I freshly installed a new conda environment, named as easydock_new. My computer has 64 cpu. Need to specify cpu, something like cpu:0?
I was referring to the -c
argument that you use to run the code.
I tried to reinstall it from scratch again and I still can't replicate your error. Maybe you can give the full error log on your side and your environment.txt? Im not sure how to approach this error.
From what I found on the internet, the error is either caused by the linux limit on how many files you can write or read (unlikely because your --protonation dimorphite
works and I assume both access similar bytes of files) or it may be because of the pkasolver torch or QueryModel()
. Maybe you can give us the snippet for the QueryModel()
class also?
@DrrDom btw for your previous qn on GPU (if you are still interested):
Both GPU and CPU gives identical protonated smiles.
For 100 smiles:
-c 30
protonates for 57.70 s
1 gpu protonates for 46.74s
2 gpu pool (shared gpu) protonates for 27.77s
4 gpu pool (shared gpu) protonates for 18.94s
Command is as: run_dock -i "$smi_file" -o "$output_file" --program vina --config config_vina.yml --protonati on pkasolver -c 1 --sdf
"--protonation dimorphite " is using the same input file.
For QueryModel(), I think it should be created by "pip install ..." into the miniconda3/env/easydock_pka/lib/python3.9/site-packages/pkasolver/query.py. I have not modified it, which is as :
class QueryModel: def init(self):
self.models = []
for i in range(25):
model_name, model_class = "GINPair", GINPairV1
model = model_class(
num_node_features, num_edge_features, hidden_channels=96
)
base_path = path.dirname(__file__)
if torch.cuda.is_available() == False: # If only CPU is available
checkpoint = torch.load(
f"{base_path}/trained_model_without_epik/best_model_{i}.pt",
map_location=torch.device("cpu"),
)
else:
checkpoint = torch.load(
f"{base_path}/trained_model_without_epik/best_model_{i}.pt"
)
model.load_state_dict(checkpoint["model_state_dict"])
model.eval()
model.to(device=DEVICE)
self.models.append(model)
def predict_pka_value(self, loader: DataLoader) -> np.ndarray:
"""
----------
loader
data to be predicted
Returns
-------
np.array
list of predicted pKa values
"""
results = []
assert len(loader) == 1
for data in loader: # Iterate in batches over the training dataset.
data.to(device=DEVICE)
consensus_r = []
for model in self.models:
y_pred = (
model(
x_p=data.x_p,
x_d=data.x_d,
edge_attr_p=data.edge_attr_p,
edge_attr_d=data.edge_attr_d,
data=data,
)
.reshape(-1)
.detach()
)
consensus_r.append(y_pred.tolist())
results.extend(
(
float(np.average(consensus_r, axis=0)),
float(np.std(consensus_r, axis=0)),
)
)
return results
environment is as: easydock_pka.txt
I am assuming easydock_pka
is the same as easydock_new
environment?
I have tried installing easydock_pka
(torch dependencies, easydock, dimorphite, and pkasolver are installed separately with pip because conda probably won't recognise it) and it still works from my side.
I am now a bit lost. What about sending me the easydock protonation.py
file then? It should be the most udpated one right?
Also, it would be helpful if you can show the error before this one too
############################
File ".../miniconda3/envs/easydock_pka/lib/python3.9/site-packages/torch/multiprocessing/reductions.py", line 366, in reduce_storage
fd, size = storage.share_fd_cpu()
RuntimeError: unable to open shared memory object </torch_4154596_2592645159_498> in read-write mode: Too many open files (24)
#############################
input files also attached : test.zip
(easydock_pka) gwb@node01: Small_Molecule/Y73C_GTP$ ./Ensemble_RunDock.sh
Traceback (most recent call last):
File "/home/gwb/miniconda3/envs/easydock_pka/bin/run_dock", line 8, in
I am assuming
easydock_pka
is the same aseasydock_new
environment?
Yes, they are the same. typo mistakes
Yes, it still runs without issue.
Ok, what about changing the protonation function. Maybe it works for your case?
def protonate_pkasolver(input_fname: str, output_fname: str, ncpu: int = 1):
from pkasolver.query import QueryModel
model = QueryModel()
with contextlib.redirect_stdout(None):
pool = Pool(ncpu)
with open(output_fname, 'wt') as f:
pkasolver_output = pool.imap_unordered(partial(__protonate_pkasolver, model=model), read_input(input_fname))
pool.close()
pool.join()
for smi, name in pkasolver_output:
f.write(f'{smi}\t{name}\n')
@Samuel-gwb, I'm a little bit lost. You posted two error messages with "too many open files". One is related to torch
, another to standard multiprocessing
. Do you use the latest version easydock/pkasolver2
branch? Do you have GPU?
Since you use -c 1
I cannot imagine how you may exceed the number of opened files.
You may increase the number of file descriptors opened simultaneously ulimit -n 4096
, but this looks like not a proper solution.
Yes that is what I thought as well.
From the error, it looks like the default multiprocessing calls to torch version, which calls the default version again. It is very interesting.
I tried not to use the ulimit solution as it is surprising that accessing one cpu would cause this issue and may be used as a last resort if everything else fails
If multiprocessing.pool
calls multiprocessing.pool
directly it should result in an error about nested processes or the like, because this is forbidden by design. If this call happens through torch
, maybe this avoids this error, but causes another one.
In that case I see two possible solutions:
force_cpu
argument to QueryModel
and set it to True
.protonate_pkasolver
function and call protonation without multiprocessing.pool
@Samuel-gwb, could you test the function below?
def protonate_pkasolver(input_fname: str, output_fname: str, ncpu: int = 1):
import torch
from pkasolver.query import QueryModel
model = QueryModel()
with contextlib.redirect_stdout(None):
if torch.cuda.is_available() or ncpu == 1:
with open(output_fname, 'wt') as f:
for mol, mol_name in read_input(input_fname):
smi, name = _protonate_pkasolver(mol, mol_name, model=model)
f.write(f'{smi}\t{name}\n')```
else:
pool = Pool(ncpu)
with open(output_fname, 'wt') as f:
for smi, name in pool.imap_unordered(partial(__protonate_pkasolver, model=model), read_input(input_fname)):
f.write(f'{smi}\t{name}\n')```
Yes, I use the same easydock_pka environment for different tests. And the last error message can be repeated for last several times. Will try your solutions with modified pka_solver function!
Very confused ! Again, freshly installed an environment, just change easydock --> easydock_test1 : $$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$ conda create -n easydock_test1 -c conda-forge python=3.9 numpy=1.20 rdkit scipy dask distributed conda activate easydock_test1 pip install paramiko meeko vina pip install git+https://github.com/Feriolet/dimorphite_dl.git pip install git+https://github.com/DrrDom/pkasolver.git@noprints pip install git+https://github.com/ci-lab-cz/easydock.git@pkasolver2 pip install torch==1.13.1+cpu --extra-index-url https://download.pytorch.org/whl/cpu pip install torch-geometric==2.0.1 pip install torch_scatter==2.1.1+pt113cpu -f https://data.pyg.org/whl/torch-1.13.1%2Bcpu.html pip install torch_sparse==0.6.17+pt113cpu -f https://data.pyg.org/whl/torch-1.13.1%2Bcpu.html pip install torch_spline_conv==1.2.2+pt113cpu -f https://data.pyg.org/whl/torch-1.13.1%2Bcpu.html pip install molvs chembl_webresource_client matplotlib pytest-cov codecov svgutils cairosvg ipython $$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$
Then,Use default protonate_pkasolver function, error:
$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$
Traceback (most recent call last):
File "/home/gwb/miniconda3/envs/easydock_test1/bin/run_dock", line 8, in
Use Feriolet‘s version to modify "/home/gwb/miniconda3/envs/easydock_test1/lib/python3.9/site-packages/easydock/protonation.py", replace contents of 'def protonate_pkasolver', then:
$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$
Traceback (most recent call last):
File "/home/gwb/miniconda3/envs/easydock_test1/bin/run_dock", line 8, in
Use Pavel's version, need to modify 'smi, name = protonate...' --> 'smi, name = protonate_...' :
$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$
Traceback (most recent call last):
File "/home/gwb/miniconda3/envs/easydock_test1/bin/run_dock", line 8, in
Please put a bracket for the mol and mol_name. This will allow the __protonate_pkasolver
to treat mol
and mol_name
as one variable instead of two. By not doing that, the function treats mol_name
as the model, which is why the code is confused
smi, name = __protonate_pkasolver((mol, mol_name), model=model)
Great, it works! Many thanks !
1) One more thing, -c have to be 1. run_dock -i GTP.smi -o GTP_vina.db --program vina --config config_vina.yml --protonation pkasolver -c 1 --sdf
Any -c > 1 will cause "Too many open files" error.
2) Another thing is that, additional nitrogen adjacent to the imidazole ring of GTP was protonated as NH-. I know that it will be NH2 when someone use schrodinger.
GTP smi: OC1C(COP(=O)(OP(=O)(OP(=O)(O)O)O)O)OC(C1O)n1cnc2c1[nH]c(N)nc2=O protonated by pkasolver --> [NH-]c1nc(=O)c2ncn([C@H]3OC@@HC@H[C@H]3[O-])c2[n-]1
multiprocessing
is used when -c
> 1, which is the package that is giving you the error. We hope that using 1 cpu is sufficient for your use case. I honestly still can't reproduce your error, so I can't really help you too much with that. I also tried running it on Apple M1, and there is also no such issue. We can still try to tackle the multiprocessing
issue if you wish to access more than 1 cpu, but it may be challenging as some of the obvious solutions do not work.
@DrrDom correct me if there is any mistake I said
The error for ncpu > 1 is strange. This means that you do not have detectable GPU and use exclusively multiprocessing
. I never met such en error for multiprocessing.
Wrong protonations may occur. Every protonation tool is incorrect to some extent. pkasolver
model publicly available was trained on single-center molecules. Therefore, prediction for complex molecules with multiple protonation centers may be incorrect. That is why applicability of different protonation tools should be studied more thoroughly. Meanwhile we may use pkasolver
as alternative to chemaxon.
I updated master
with the most recent changes. I'll keep the issue open, because I believe we will return to it in future.
Thanks a lot to everybody who helped with that!
Great ! Some tiny things: 1) in readme, the line for pip installaltion of torch_spline_conv contains additional ''' at the end.
2) It seems that "pip install cairosvg svgutils" is needed. And, at last, installation may need include "pip install ." at $easydock_home.
3) Thus, when using cpu-based pkasolver for protonation, one need set "-c 1" ? If so, include it in readme?
Thank you!
easydock
was described. Your suggestion is not relevant for ordinary users. This is mainly for developers, who has a clone of the repository. I'll update the pypi package soon. I expect to close another PR before officially update the version.I agreed with @Samuel-gwb for his 2nd point. Don't we need the pip install cairosvg svgutils
to run the pkasolver? At least for my side it gave the import error for the cairosvg.
Edit: nvm I think I got ur point, my bad
You were right) Thanks to pointing me out. I indeed missed to add these packages cairosvg svgutils
to the list of required ones. I'll do that.
maybe switch to Dimorphite-DL: https://jcheminf.biomedcentral.com/articles/10.1186/s13321-019-0336-9
I am in academia and I don't even have a chemaxon license anymore... Software vendors always change their license terms one day or another...