ci-lab-cz / easydock

BSD 3-Clause "New" or "Revised" License
35 stars 12 forks source link

dependance on chemaxon is annoying #17

Open UnixJunkie opened 5 months ago

UnixJunkie commented 5 months ago

maybe switch to Dimorphite-DL: https://jcheminf.biomedcentral.com/articles/10.1186/s13321-019-0336-9

I am in academia and I don't even have a chemaxon license anymore... Software vendors always change their license terms one day or another...

DrrDom commented 5 months ago

Agree, chemaxon change of an academic license is inappropriate. I now consider different options. Previously I tested Dimophite-DL, pkasolver and chemaxon and at the first glance chemaxon performed better in complex cases. However, I would prefer a more systematic study. At least it would be reasonable to compare Dimorphite-DL and pkasolver. Unfortunately, this will take time, so it will not be fast, but we will try different alternatives. Actually, this could be an opportunity to develop something new

DrrDom commented 5 months ago

It is still possible to disable chemaxon protonation (--no_protonation) and use some external tools to protonate structures before docking

UnixJunkie commented 5 months ago

as another hackish alternative, there would be obabel's -p option (Add Hydrogens appropriate for pH)

DrrDom commented 3 months ago

I implemented an easy way to integrate protonation tools. I also implemented dimorphite as an example in the separate branch dimorphite-example.

What I dislike with dimorphite

I would prefer if dimorphite will be converted into a package to be installed by pip and can be run using pool on multiple cores. Ideally, if a bug with catching the main argparser using dimorphite API will be also fixed.

If someone will help with that, it will be faster to integrate dimorphote.

@Feriolet, would you be able to help with that?

Feriolet commented 3 months ago

Yes, these have been some of the pain points of using Dimorphite-DL for me as well. I have tried to run 250k compounds, but the init_db ran slowly (around 7 hours). Dimirphite-DL may be the cause, but I have not tested it.

It is also annoying the argparse of Dimorphite is problematic so I cant be used it as a package, and the current solution I have uses subprocess.

I can try to fix the bug of Dimorphite_DL

Feriolet commented 3 months ago

I have found the reason for the argparse error for dimorphite_dl. The main reason that the dimorphite catches the main program argument is because of the command line args = vars(parser.parse_args()) where .parse_args() collect the argument from the terminal as the default (main run_dock argument). To override the default setting and use the desired argument, we have to put the list ['--argument',value,'--argument',value] inside the .parse_args(list).

Since the **kwargs returns a dict_output, I have made a temporary function that converts the dict into a flat list:

def convert_kwargs_to_sys_argv(params: dict) -> list:
    sys_argv_list = []
    for key, value in params.items():
        # Since some argument have store_true argument, we don't need to include the 'True' value into the sys.argv
        if value is True:
            sys_argv_list.append('--'+key)
        elif value is False:
            continue
        else:
            sys_argv_list.append('--'+key)
            sys_argv_list.append(str(value))

    return sys_argv_list

Then, the main() function of the dimorphite_dl is modified to:

def main(params=None):
    """The main definition run when you call the script from the commandline.

    :param params: The parameters to use. Entirely optional. If absent,
                   defaults to None, in which case argments will be taken from
                   those given at the command line.
    :param params: dict, optional
    :return: Returns a list of the SMILES strings return_as_list parameter is
             True. Otherwise, returns None.
    """
    sys_argv_list = convert_kwargs_to_sys_argv(params)
    parser = ArgParseFuncs.get_args()
    args = vars(parser.parse_args(sys_argv_list))
    print(args)
    if not args["silent"]:
        print_header()

    # Add in any parameters in params.
    if params is not None:
        for k, v in params.items():
            args[k] = v

    # If being run from the command line, print out all parameters.
    if __name__ == "__main__":
        if not args["silent"]:
            print("\nPARAMETERS:\n")
            for k in sorted(args.keys()):
                print(k.rjust(13) + ": " + str(args[k]))
            print("")

    if args["test"]:
        # Run tests.
        TestFuncs.test()
    else:
        # Run protonation
        if "output_file" in args and args["output_file"] is not None:
            # An output file was specified, so write to that.
            with open(args["output_file"], "w") as file:
                for protonated_smi in Protonate(args):
                    file.write(protonated_smi + "\n")
        elif "return_as_list" in args and args["return_as_list"] == True:
            return list(Protonate(args))
        else:
            # No output file specified. Just print it to the screen.
            for protonated_smi in Protonate(args):
                print(protonated_smi)
DrrDom commented 3 months ago

Good first step. Thank you!

I checked the code of Dimorphite for parallelization and may say it can be not the most difficult part. The code of a generator class looks quite complex and I do not understand whether this complexity is necessary or not.

Feriolet commented 3 months ago

Yes, it will take time for me to understand the code as it uses multiple Classes in huge chunk as well.

Feriolet commented 3 months ago

I noticed that the Protonate() class is an iterator with the __next()__ and __iter()__. Is it even possible to use multiprocessing for this type of class? Calling the Protonate(args) may immediately loop the smi list before we can multiprocess it.

EDIT: nvm it seems that a for loop calles for the next() function of an iterator. I'll see if I can call multiprocessing within/outside of the class type.

DrrDom commented 3 months ago

In theory it should be possible, but I do not have experience with such classes. I usually implement such kind of things in a different way

A very simple example:

with Pool(ncpu) as pool:
    with open(smi_filename) as f:
        for mol_id, protonated_smi in pool.imap_unordered(Protonate, f):
             if protonated_smi:
                  update_db(mol_id, protonated_smi)

where Protonate would take a line and returns mol_id and protonated_smi. f is actually a generator that avoids memory overflow.

There are some examples of generators working with multiprocessing - https://codereview.stackexchange.com/questions/259029/compute-the-outputs-of-a-generator-in-parallel, but those implementations are also simpler.

Feriolet commented 3 months ago

I have tried to understand how iterator class work and I think I got a few ideas on what to do. The dimorphite class needs to take the input filename, and the iterator is generated by the LoadSMIFile() class and stored in the args["smiles_and_data] when the ArgParseFuncs.clean_args() is called. Because of the nature of Iterator class which has the next(iterator) function, the input must be an iterator type, not iterable type.

The above code probably won't work because the Protonate() will treat f as the args input, which is not what we wanted.

My idea is that we can either change how the Protonate() treats the input by separating the args["smiles_and_data"] code from the args which means we have to add another argument Protonate.__init__(self, args, smi_iterator). But I don't know if it will break how dimorphite intends to run the code. Another ridiculous way is that we can split the smi files to n number of smi (e.g., 250k smi will have 250k separate files), which can be easily multiprocess and we don't need to modify the code too much. But again, generating 250k files is kind of ridiculous.

If we plan to change the code, we can probably make the Protonate() take the input of iter([iter(smi),iter(smi),...] where smi represents one smi string with the following function:

def convert_list_to_iter_in_iter(input_list: list):

    iter_list = [iter(str(element)) for element in input_list]
    return iter(iter_list)

then, we can use pool.imap_unordered(Protonate, iter(iter_list)) so the input as treated as iterator rather than a list/string

DrrDom commented 3 months ago

The idea of multiple input files is very nice, because we nevertheless will store all these data on a disk and it does not matter in how many files.

We can create 2x-3x files as the number of CPUs submitted by a user in the command line (argument -c), propagate this ncpu argument to the add_protonation function and use it to run multiples input files using pool and subprocess. Not a perfect solution but fast and working.

If the speed will be too slow, we may think to use dask for this task, but I hope it will be reasonably good.

DrrDom commented 3 months ago

You may create a package of dimorphite-dl under your account. Uploading it to the pypi is not necessary, we may specify in README to install dimorphite directly from your repository.

Feriolet commented 3 months ago

By installing, do you mean through git clone in the repo?

Btw, using 1 cpu to protonate 250k requires 1037s. So, I guess that using the ncpu would be sufficient if we want to scale up the protonation

DrrDom commented 3 months ago

I mean pip install git+.... So I suggest to make it similarly to easydock. Then we will be able to import it from easydockand run directly from Python even without subprocess (since we will solve the issue with command line arguments parsing).

Samuel-gwb commented 3 months ago

Is Dimorphite ready to use in the branch ?

Feriolet commented 3 months ago

I mean pip install git+.... So I suggest to make it similarly to easydock. Then we will be able to import it from easydockand run directly from Python even without subprocess (since we will solve the issue with command line arguments parsing).

Oh okay, but how is it different than using the dimorphite_example currently? I am forking that branch to modify the code to implement the current feature. I will try to make a pull request soon to fix the dimorphite implementation

DrrDom commented 3 months ago

@Samuel-gwb, yes, but it uses one core and for large sets this will take some time. 250k molecules ~ 20 min.

@Feriolet, the difference will be that we can completely remove directory and files of dimorphite from easydock package. Dimorphite will be installed as an individual package and imported similarly to the current import (only the name of the package will change) If you need more details, please ask

Feriolet commented 3 months ago

But we do still need to change the dimorphite source code right? I feel that it is inconvenient for the user to install it as a separate package and then alter it by themselves. Or is there something that I missed about installing it in a different directory from easydock?

Edit: nvm I didnt read that separating them will solve the issue of arg parse. Will do that!

DrrDom commented 3 months ago

Yes, the source code should be modified to enable mutiprocessing and correct treatment of command line arguments. A separate package will not solve the issue with command line arguments, so this should be explicitly fixed.

Installation as a package is preferable and convenient.

  1. No license mixture needed.
  2. Maintain a separate package is simpler. It may also be further developed independently, if someone will find these changes useful.
  3. A user nevertheless installs a number of packages. This will require to copy-paste one line from README.
Feriolet commented 3 months ago

I can't install dimorphite through pip.

ERROR: git+https://github.com/durrantlab/dimorphite_dl does not appear to be a Python project: neither 'setup.py' nor 'pyproject.toml' found.

DrrDom commented 3 months ago

Yes, of course. Therefore, it should be converted into a Python package to enable installation from pip.

This is quite easy. You may check the structure of easydock repository. You have to add setup.cfg, setup.py, LICENSE.txt, README.md, MANIFEST.md to the root dir of your fork of the repository, move other files in a subdirectory named for example dimorphite and add in that subdirectory file __init_.py (it may be empty).

In the MANIFEST.md you should list files which are not python scripts (*.py), but which should be supplied with the package. In dimorpfite this will be the file with patterns (at least for the first glance). Example if the file - https://github.com/DrrDom/pmapper/blob/master/MANIFEST.in

Feriolet commented 3 months ago

Huh I tried creating the setup.cfg and setup.py for the dimorphite but it does not create a library in the conda environment (unless you considered egg-link to be one). I made the dimorphite directory in my Desktop directory, then install it with pip install -e .

It works for now though. I have tried re-running it and it works. I forgot what min and max pH I used for my initial run, so I can't test if my code in my pull request actually affect the docking score. As I can't reproduce the same protonated smi and docking score for my code, maybe you can see if the code I send in the pull request is good enough.

Also, since we are creating the setup.cfg and setup.py, are we going to include them into our repository as well, or do we ask the user to do it on their own in the README file?

DrrDom commented 3 months ago

Installing with pip install -e . does not transfer files to the conda environment, but create links. This is almost like an ordinary installation. So, everything is OK.

You have to create your fork on github with dimorhite code and edit it to fix problems which interfere with easydock. Afterwards you can make PR to the original repository, If maintainers will accept it this will be excellent. If not, we can mention in easydock README to install dimorphite from pip git+https://github.com/Feriolet/dimorphite_dl. In the latter case you will become the maintainer of this specific version of dimorphite. So, no dimorhite files should be included in easydock repo.

Samuel-gwb commented 3 months ago

Seems that successfully using dimorphite_dl by "pip git+https://github.com/Feriolet/dimorphite_dl". Great work! Update the official code according to the branches?

DrrDom commented 3 months ago

Yes, tests were passed successfully. I merged all changes to master, but I'll postpone to release a new version, because there is one critical issue which should be somehow addressed.

i noticed that Dimorphite does not behave well with nitrogen centers and molecules having more than a single protonation center. Below are several test examples. In the cases 3 and 4 the nitrogen atom was not protonated as expected.

smi         smi_protonated
Oc1ccccn1       [O-]c1ccccn1
O=c1cccc[nH]1       O=c1cccc[n-]1
NCC(=O)O        NCC(=O)[O-]
C[C@@H]1CCCN(C)C1   C[C@@H]1CCCN(C)C1

Dimorphite was designed to enumerate a list of possible forms at a given pH and it cannot predict the most favorable one. The logic behind was to enumerate all reasonable forms and dock all of them. Since we limit the number of protonated forms to 1, Dimorphite takes the first one from the list and returns it.

Therefore, we have two options:

Alternative solution - pkasolver. Half a year ago I adapted pkasolver to predict the most stable protonation state. I took a form with pKa closest to 7.4. pkasolver also has some drawbacks and errors in assignment of protonation states. However, to some extent it can treat molecules with multiple centers (but the model which is publicly available was trained on single-center molecules only). I have to adapt previous implementation to the new program architecture and can make a PR to enable discussion of all solutions.

Therefore, in the beginning I mentioned that it would be good to study and compare all available solutions (free or open-source) to choose more appropriate one(s).

Feriolet commented 3 months ago

@Samuel-gwb sorry for the late reply but yes the code should have been merged. We will also update the README to include the dimorphite_dl fork as well.

While the latter one seems ideal, it may not work well with small molecules with multiple protonation state as you mentioned. I have tried to protonate this molecule CCCC[C@@H](CN[C@@H](CCCC)C(=O)[N-][C@@H](CCC([NH-])=O)C(=O)[N-][C@@H](CCCNC(N)=[NH2+])C([NH-])=O)[N-]C(=O)[C@@H]([N-]C(=O)[C@@H]([N-]C(C)=O)[C@@H](C)O)[C@@H](C)CC in the past and dimorphite_dl produces over 128 protonated molecules (default limit if i remember correctly), which may significantly unnecessarily increase the computational time and resources.

Yes, it would be great if we can implement multiple solutions in the future if possible. Regarding pkasolver, did you mean by this https://github.com/mayrf/pkasolver ? From what I look briefly, it seems that we would need to use all possible protonated molecules as the input and return a single molecule with the closest pKa=7.

DrrDom commented 3 months ago

Yes, this is pkasolver repo. It takes a single molecule and enumerates multiple protonation forms with predicted pKa and I chose the one closest to 7.4. I'll try to adapt it quickly and make PR to show the idea and test it. With the new architecture it should be relatively easy to do.

You are right, I missed the issue of too many forms for a single molecule. It may really substantially increase docking time and may not be necessity. This is another cons.

Samuel-gwb commented 3 months ago

Thanks for your quick response @DrrDom @Feriolet. And apprecieate your thinkings and efforts very much! Also like easydock very much. Really easy to do screen :-) Hope a efficient solution.

DrrDom commented 3 months ago

I uploaded the branch pkasolver2. It is expected to work, however, I could not install dependencies for pkasolver, therefore I did not test it, but previously almost the same code worked. You may play with it. I remember also some strange behavior. To run it just specify run_dock ... --protonation pkasolver

Samuel-gwb commented 3 months ago

Got it! thanks

Feriolet commented 3 months ago

In case you are testing the code very soon, do change the code of protonation.py line 11 from from read_input import read_input to from easydock.read_input import read_input to prevent the import error.

I am unable to test the pkasolver because of cudatoolkit in the moment, will update when I have the chance to do so.

DrrDom commented 3 months ago

Thank you, fixed! If someone will test it, please report whether it works and how.

I cannot create an environment from the yml-file and manual installation requires specific versions of packages because on my old system not all recent versions can be installed.

Samuel-gwb commented 3 months ago

After download pkasolver2, conda activate easydock_pka -- install dependencies -- 1) pip install torch==1.13.1+cu116 torchvision==0.14.1+cu116 torchaudio==0.13.1 --extra-index-url https://download.pytorch.org/whl/cu116 2) pip install torch-geometric==2.0.1 torch-sparse torch-scatter molvs chembl_webresource_client 3) conda install -c conda-forge matplotlib pytest-cov codecov svgutils cairosvg ipython 4) python setup.py install

Then run_dock ... --protonation pkasolver -c 1 --sdf

Errors appear: ############################################### Process ForkPoolWorker-1: Traceback (most recent call last): File "/home/gwb/miniconda3/envs/easydock_pka/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap self.run() File "/home/gwb/miniconda3/envs/easydock_pka/lib/python3.9/multiprocessing/process.py", line 108, in run self._target(*self._args, **self._kwargs) File "/home/gwb/miniconda3/envs/easydock_pka/lib/python3.9/multiprocessing/pool.py", line 114, in worker task = get() File "/home/gwb/miniconda3/envs/easydock_pka/lib/python3.9/multiprocessing/queues.py", line 367, in get return _ForkingPickler.loads(res) File "/home/gwb/miniconda3/envs/easydock_pka/lib/python3.9/site-packages/torch/multiprocessing/reductions.py", line 120, in rebuild_cuda_tensor torch.cuda._lazy_init() File "/home/gwb/miniconda3/envs/easydock_pka/lib/python3.9/site-packages/torch/cuda/init.py", line 217, in _lazy_init raise RuntimeError( RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method ###############################################

Code in init.py is as (raise RuntimeError ... ): ############################################### def _lazy_init(): global _initialized, _queued_calls if is_initialized() or hasattr(_tls, 'is_initializing'): return with _initialization_lock:

We be double-checked locking, boys! This is OK because

    # the above test was GIL protected anyway.  The inner test
    # is for when a thread blocked on some other thread which was
    # doing the initialization; when they get the lock, they will
    # find there is nothing left to do.
    if is_initialized():
        return
    # It is important to prevent other threads from entering _lazy_init
    # immediately, while we are still guaranteed to have the GIL, because some
    # of the C calls we make below will release the GIL
    if _is_in_bad_fork():
        raise RuntimeError(
            "Cannot re-initialize CUDA in forked subprocess. To use CUDA with "
            "multiprocessing, you must use the 'spawn' start method")

###############################################

Samuel-gwb commented 3 months ago

Is it possible to carry out pka_solver before run_dock? Something like: ensemble.smi --> ensemble_pka.smi and then run_dock ensemble_pka.smi ...

And ensemble.smi --> ensemble_pka.smi means, due to pka_solver : example_1 --> example_1 example_2 --> example_2_01 example_2_02 ... ...

This is mainly to avoid involve pka_solver into a sub/multi process, which seems leading to error.

Feriolet commented 3 months ago

Can I know how many GPU do you have on your machine?

It is possible to protonate your smiles first with separate script and then run the run_dock with the --no_protonation argument

Samuel-gwb commented 3 months ago

2 GPUs. Need specification?

UnixJunkie commented 3 months ago

Depending on a GPU might also be annoying. It would be nice if the code can run on a regular laptop, which might not have any decent GPU. Some cluster nodes have very powerful CPUs but no GPU, etc.

On Tue, Apr 9, 2024 at 1:58 PM SamuelGong @.***> wrote:

2 GPUs. Need specification?

— Reply to this email directly, view it on GitHub https://github.com/ci-lab-cz/easydock/issues/17#issuecomment-2044148549, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAFUFAA7BA22DD56VNQWNJ3Y4NYOHAVCNFSM6AAAAABB6AWL2SVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANBUGE2DQNJUHE . You are receiving this because you authored the thread.Message ID: @.***>

Feriolet commented 3 months ago

yes that is true. I have not install the pkasolver because of CUDA compatibility issue right now, but if I may know, does pkasolver use cpu or gpu to run the program? If possible, we can force the pkasolver to use cpu instead of gpu to make our code consistent

Samuel-gwb commented 3 months ago

Yes. By default, pkasolver use gpu, indicated by nvidia-smi.

Feriolet commented 3 months ago

@Samuel-gwb can you provide me with the full torch version and CUDA? I tried to use your version mentioned above but I get a segmentation fault (core dumped) error

Feriolet commented 3 months ago

The QueryModel() class seems to accomodate both GPU and CPU as can be seen in this code:

            if torch.cuda.is_available() == False:  # If only CPU is available
                checkpoint = torch.load(
                    f"{base_path}/trained_model_without_epik/best_model_{i}.pt",
                    map_location=torch.device("cpu"),
                )
            else:
                checkpoint = torch.load(
                    f"{base_path}/trained_model_without_epik/best_model_{i}.pt"
                )
Samuel-gwb commented 3 months ago

Please check the attachment

environment_easydock_pka.txt

Samuel-gwb commented 3 months ago

It seems that running on cpu has problem. I created a new environment with cpu version torch. However, errors appear as, using previous run_dock ... --protonation pkasolver :

File "/home/gwb/miniconda3/envs/easydock_pkaCPU/lib/python3.9/site-packages/torch/nn/modules/module.py", line 2153, in load_state_dict raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format( RuntimeError: Error(s) in loading state_dict for GINPairV1: Missing key(s) in state_dict: "GIN_p.convs.0.nn.lins.0.weight", "GIN_p.convs.0.nn.lins.0.bias", "GIN_p.convs.0.nn.lins.1.weight", "GIN_p.convs.0.nn.lins.1.bias", "GIN_p.convs.1.nn.lins.0.weight", "GIN_p.convs.1.nn.lins.0.bias", "GIN_p.convs.1.nn.lins.1.weight", "GIN_p.convs.1.nn.lins.1.bias", "GIN_p.convs.2.nn.lins.0.weight", "GIN_p.convs.2.nn.lins.0.bias", "GIN_p.convs.2.nn.lins.1.weight", "GIN_p.convs.2.nn.lins.1.bias", "GIN_p.convs.3.nn.lins.0.weight", "GIN_p.convs.3.nn.lins.0.bias", "GIN_p.convs.3.nn.lins.1.weight", "GIN_p.convs.3.nn.lins.1.bias", "GIN_d.convs.0.nn.lins.0.weight", "GIN_d.convs.0.nn.lins.0.bias", "GIN_d.convs.0.nn.lins.1.weight", "GIN_d.convs.0.nn.lins.1.bias", "GIN_d.convs.1.nn.lins.0.weight", "GIN_d.convs.1.nn.lins.0.bias", "GIN_d.convs.1.nn.lins.1.weight", "GIN_d.convs.1.nn.lins.1.bias", "GIN_d.convs.2.nn.lins.0.weight", "GIN_d.convs.2.nn.lins.0.bias", "GIN_d.convs.2.nn.lins.1.weight", "GIN_d.convs.2.nn.lins.1.bias", "GIN_d.convs.3.nn.lins.0.weight", "GIN_d.convs.3.nn.lins.0.bias", "GIN_d.convs.3.nn.lins.1.weight", "GIN_d.convs.3.nn.lins.1.bias". Unexpected key(s) in state_dict: "GIN_p.lin.weight", "GIN_p.lin.bias", "GIN_p.convs.0.nn.0.weight", "GIN_p.convs.0.nn.0.bias", "GIN_p.convs.0.nn.1.weight", "GIN_p.convs.0.nn.1.bias", "GIN_p.convs.0.nn.1.running_mean", "GIN_p.convs.0.nn.1.running_var", "GIN_p.convs.0.nn.1.num_batches_tracked", "GIN_p.convs.0.nn.3.weight", "GIN_p.convs.0.nn.3.bias", "GIN_p.convs.1.nn.0.weight", "GIN_p.convs.1.nn.0.bias", "GIN_p.convs.1.nn.1.weight", "GIN_p.convs.1.nn.1.bias", "GIN_p.convs.1.nn.1.running_mean", "GIN_p.convs.1.nn.1.running_var", "GIN_p.convs.1.nn.1.num_batches_tracked", "GIN_p.convs.1.nn.3.weight", "GIN_p.convs.1.nn.3.bias", "GIN_p.convs.2.nn.0.weight", "GIN_p.convs.2.nn.0.bias", "GIN_p.convs.2.nn.1.weight", "GIN_p.convs.2.nn.1.bias", "GIN_p.convs.2.nn.1.running_mean", "GIN_p.convs.2.nn.1.running_var", "GIN_p.convs.2.nn.1.num_batches_tracked", "GIN_p.convs.2.nn.3.weight", "GIN_p.convs.2.nn.3.bias", "GIN_p.convs.3.nn.0.weight", "GIN_p.convs.3.nn.0.bias", "GIN_p.convs.3.nn.1.weight", "GIN_p.convs.3.nn.1.bias", "GIN_p.convs.3.nn.1.running_mean", "GIN_p.convs.3.nn.1.running_var", "GIN_p.convs.3.nn.1.num_batches_tracked", "GIN_p.convs.3.nn.3.weight", "GIN_p.convs.3.nn.3.bias", "GIN_d.lin.weight", "GIN_d.lin.bias", "GIN_d.convs.0.nn.0.weight", "GIN_d.convs.0.nn.0.bias", "GIN_d.convs.0.nn.1.weight", "GIN_d.convs.0.nn.1.bias", "GIN_d.convs.0.nn.1.running_mean", "GIN_d.convs.0.nn.1.running_var", "GIN_d.convs.0.nn.1.num_batches_tracked", "GIN_d.convs.0.nn.3.weight", "GIN_d.convs.0.nn.3.bias", "GIN_d.convs.1.nn.0.weight", "GIN_d.convs.1.nn.0.bias", "GIN_d.convs.1.nn.1.weight", "GIN_d.convs.1.nn.1.bias", "GIN_d.convs.1.nn.1.running_mean", "GIN_d.convs.1.nn.1.running_var", "GIN_d.convs.1.nn.1.num_batches_tracked", "GIN_d.convs.1.nn.3.weight", "GIN_d.convs.1.nn.3.bias", "GIN_d.convs.2.nn.0.weight", "GIN_d.convs.2.nn.0.bias", "GIN_d.convs.2.nn.1.weight", "GIN_d.convs.2.nn.1.bias", "GIN_d.convs.2.nn.1.running_mean", "GIN_d.convs.2.nn.1.running_var", "GIN_d.convs.2.nn.1.num_batches_tracked", "GIN_d.convs.2.nn.3.weight", "GIN_d.convs.2.nn.3.bias", "GIN_d.convs.3.nn.0.weight", "GIN_d.convs.3.nn.0.bias", "GIN_d.convs.3.nn.1.weight", "GIN_d.convs.3.nn.1.bias", "GIN_d.convs.3.nn.1.running_mean", "GIN_d.convs.3.nn.1.running_var", "GIN_d.convs.3.nn.1.num_batches_tracked", "GIN_d.convs.3.nn.3.weight", "GIN_d.convs.3.nn.3.bias".

Feriolet commented 3 months ago

It seems that this will be quite troublesome.

I have finally solved the seg fault error by using a different version of torch by running the following code: conda install pytorch==1.11.0 pytorch-cuda=11.7 pytorch_scatter pytorch_sparse pytorch_cluster pytorch_spline_conv pyg -c pytorch -c nvidia -c pyg

The RuntimeError: Error(s) in loading state_dict Error mentioned by @Samuel-gwb can be solved by including strict=False option in the .load_state_dict() function in the query.py file. However, I am not sure if this will affect the result of the protonation. Note that I only allow "cpu" and block the "cuda" device to only allow cpu to help protonate the smi.

class QueryModel:
    def __init__(self):

        self.models = []

        for i in range(25):
            model_name, model_class = "GINPair", GINPairV1
            model = model_class(
                num_node_features, num_edge_features, hidden_channels=96
            )
            base_path = path.dirname(__file__)
            # if torch.cuda.is_available() == False:  # If only CPU is available
            checkpoint = torch.load(
                    f"{base_path}/trained_model_without_epik/best_model_{i}.pt",
                    map_location=torch.device("cpu"),
                )
            # else:
            #     checkpoint = torch.load(
            #         f"{base_path}/trained_model_without_epik/best_model_{i}.pt"
            #     )

            model.load_state_dict(checkpoint["model_state_dict"], strict=False)
            model.eval()
            model.to(device=DEVICE)
            self.models.append(model)

I have found that using Pool for pkasolver can cause a multiprocessing error with the CUDA, since pkasolver calls for GPU and CPU with the torch.device() . The error is the RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method. This can be solved by using torch.multiprocessing.set_start_method() to use the spawn method.

from torch import multiprocessing as mp
def protonate_pkasolver(input_fname: str, output_fname: str, ncpu: int = 1):
    mp.set_start_method('spawn', force=True)
    from pkasolver.query import QueryModel
    model = QueryModel()
    pool = mp.Pool(ncpu)
    with open(output_fname, 'wt') as f:
        for smi, name in pool.imap_unordered(__protonate_pkasolver,
                                             ((mol, mol_name, model) for (mol, mol_name) in read_input(input_fname))):
            f.write(f'{smi}\t{name}\n')

However, there are other errors called after calling torch.multiprocessing.Pool which may be due to memory allocation -> [W CudaIPCTypes.cpp:92] Producer process tried to deallocate over 1000 memory blocks referred by consumer processes. Deallocation might be significantly slowed down. We assume it will never going to be the case, but if it is, please file but to https://github.com/pytorch/pytorch [W CudaIPCTypes.cpp:15] Producer process has been terminated before all shared CUDA tensors released. See Note [Sharing CUDA tensors]

I am not sure how to deal with this.

Another funny thing is that the dimorphite_dl inside the pkasolver does not seem to store the output of the protonated_smiles. When running run_dock, it gives the following output

CCCCC
<rdkit.Chem.rdchem.Mol object at 0x7fab1e8c6a50>

For help, use: python dimorphite_dl.py --help

If you use Dimorphite-DL in your research, please cite:
Ropp PJ, Kaminsky JC, Yablonski S, Durrant JD (2019) Dimorphite-DL: An
open-source program for enumerating the ionization states of drug-like small
molecules. J Cheminform 11:14. doi:10.1186/s13321-019-0336-9.

CCCCCCC
<rdkit.Chem.rdchem.Mol object at 0x7f6d1d82fa50>

For help, use: python dimorphite_dl.py --help

If you use Dimorphite-DL in your research, please cite:
Ropp PJ, Kaminsky JC, Yablonski S, Durrant JD (2019) Dimorphite-DL: An
open-source program for enumerating the ionization states of drug-like small
molecules. J Cheminform 11:14. doi:10.1186/s13321-019-0336-9.

We may need to modify and fix these errors if we still want to pursue pkasolver as our alternative

DrrDom commented 3 months ago

Finally, I found an old environment with installed pkasolver. Since as a backend for enumeration of protonation states it uses dimorphite (and includes it within the package as is), there is the same issue with command line arguments, when dimorphite catches all input arguments from the main program. So, pkasolver package should be modified, either to use the external custom dimorphite package or by fixing this issue inside the pkasolver package. I'll try to use the first option and modify the pkasolver package. This will result in a custom pkasolver version, which should be installed from my repo.

That is strange that this error did not occur for you.

I do not have GPU on my laptop, so I cannot test how it works with GPU. Before, I did not have errors due to multiprocessing, but maybe just because i do not have GPU.

I agree that switching to CPU mode for pkasolver is a good decision. We may customize further pkasolver package to enable this control for a user. For example, by adding an argument to QueryModel.__init__

Feriolet commented 3 months ago

Yes, I guess it can be done.

I finally have installed pytorch-CPU only and it has similar issues with the GPU version. Using torch-2.2.0 (latest version) is sufficient since there is no CUDA limitation. https://pytorch-geometric.readthedocs.io/en/latest/notes/installation.html

Apparently, the memory allocation issue does not occur on the CPU, which is a good thing.

I forgot to mention that there is also this issue, which is odd, maybe because the pkasolver code does not catch the dimorphite output

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/user/miniforge3/envs/easydock/bin/run_dock", line 8, in <module>
    sys.exit(main())
  File "/home/user/miniforge3/envs/easydock/lib/python3.9/site-packages/easydock/run_dock.py", line 207, in main
    add_protonation(args.output, program=args.protonation, tautomerize=not args.no_tautomerization, ncpu=args.ncpu)
  File "/home/user/miniforge3/envs/easydock/lib/python3.9/site-packages/easydock/database.py", line 348, in add_protonation
    protonate_func(input_fname=tmp.name, output_fname=output)
  File "/home/user/miniforge3/envs/easydock/lib/python3.9/site-packages/easydock/protonation.py", line 107, in protonate_pkasolver
    for smi, name in pool.imap_unordered(__protonate_pkasolver,
  File "/home/user/miniforge3/envs/easydock/lib/python3.9/multiprocessing/pool.py", line 870, in next
    raise value
TypeError: '<' not supported between instances of 'Mol' and 'Mol'
Feriolet commented 3 months ago

It seems that running on cpu has problem. I created a new environment with cpu version torch.

However, errors appear as, using previous run_dock ... --protonation pkasolver :

File "/home/gwb/miniconda3/envs/easydock_pkaCPU/lib/python3.9/site-packages/torch/nn/modules/module.py", line 2153, in load_state_dict

raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(

RuntimeError: Error(s) in loading state_dict for GINPairV1:

    Missing key(s) in state_dict: "GIN_p.convs.0.nn.lins.0.weight", "GIN_p.convs.0.nn.lins.0.bias", "GIN_p.convs.0.nn.lins.1.weight", "GIN_p.convs.0.nn.lins.1.bias", "GIN_p.convs.1.nn.lins.0.weight", "GIN_p.convs.1.nn.lins.0.bias", "GIN_p.convs.1.nn.lins.1.weight", "GIN_p.convs.1.nn.lins.1.bias", "GIN_p.convs.2.nn.lins.0.weight", "GIN_p.convs.2.nn.lins.0.bias", "GIN_p.convs.2.nn.lins.1.weight", "GIN_p.convs.2.nn.lins.1.bias", "GIN_p.convs.3.nn.lins.0.weight", "GIN_p.convs.3.nn.lins.0.bias", "GIN_p.convs.3.nn.lins.1.weight", "GIN_p.convs.3.nn.lins.1.bias", "GIN_d.convs.0.nn.lins.0.weight", "GIN_d.convs.0.nn.lins.0.bias", "GIN_d.convs.0.nn.lins.1.weight", "GIN_d.convs.0.nn.lins.1.bias", "GIN_d.convs.1.nn.lins.0.weight", "GIN_d.convs.1.nn.lins.0.bias", "GIN_d.convs.1.nn.lins.1.weight", "GIN_d.convs.1.nn.lins.1.bias", "GIN_d.convs.2.nn.lins.0.weight", "GIN_d.convs.2.nn.lins.0.bias", "GIN_d.convs.2.nn.lins.1.weight", "GIN_d.convs.2.nn.lins.1.bias", "GIN_d.convs.3.nn.lins.0.weight", "GIN_d.convs.3.nn.lins.0.bias", "GIN_d.convs.3.nn.lins.1.weight", "GIN_d.convs.3.nn.lins.1.bias".

    Unexpected key(s) in state_dict: "GIN_p.lin.weight", "GIN_p.lin.bias", "GIN_p.convs.0.nn.0.weight", "GIN_p.convs.0.nn.0.bias", "GIN_p.convs.0.nn.1.weight", "GIN_p.convs.0.nn.1.bias", "GIN_p.convs.0.nn.1.running_mean", "GIN_p.convs.0.nn.1.running_var", "GIN_p.convs.0.nn.1.num_batches_tracked", "GIN_p.convs.0.nn.3.weight", "GIN_p.convs.0.nn.3.bias", "GIN_p.convs.1.nn.0.weight", "GIN_p.convs.1.nn.0.bias", "GIN_p.convs.1.nn.1.weight", "GIN_p.convs.1.nn.1.bias", "GIN_p.convs.1.nn.1.running_mean", "GIN_p.convs.1.nn.1.running_var", "GIN_p.convs.1.nn.1.num_batches_tracked", "GIN_p.convs.1.nn.3.weight", "GIN_p.convs.1.nn.3.bias", "GIN_p.convs.2.nn.0.weight", "GIN_p.convs.2.nn.0.bias", "GIN_p.convs.2.nn.1.weight", "GIN_p.convs.2.nn.1.bias", "GIN_p.convs.2.nn.1.running_mean", "GIN_p.convs.2.nn.1.running_var", "GIN_p.convs.2.nn.1.num_batches_tracked", "GIN_p.convs.2.nn.3.weight", "GIN_p.convs.2.nn.3.bias", "GIN_p.convs.3.nn.0.weight", "GIN_p.convs.3.nn.0.bias", "GIN_p.convs.3.nn.1.weight", "GIN_p.convs.3.nn.1.bias", "GIN_p.convs.3.nn.1.running_mean", "GIN_p.convs.3.nn.1.running_var", "GIN_p.convs.3.nn.1.num_batches_tracked", "GIN_p.convs.3.nn.3.weight", "GIN_p.convs.3.nn.3.bias", "GIN_d.lin.weight", "GIN_d.lin.bias", "GIN_d.convs.0.nn.0.weight", "GIN_d.convs.0.nn.0.bias", "GIN_d.convs.0.nn.1.weight", "GIN_d.convs.0.nn.1.bias", "GIN_d.convs.0.nn.1.running_mean", "GIN_d.convs.0.nn.1.running_var", "GIN_d.convs.0.nn.1.num_batches_tracked", "GIN_d.convs.0.nn.3.weight", "GIN_d.convs.0.nn.3.bias", "GIN_d.convs.1.nn.0.weight", "GIN_d.convs.1.nn.0.bias", "GIN_d.convs.1.nn.1.weight", "GIN_d.convs.1.nn.1.bias", "GIN_d.convs.1.nn.1.running_mean", "GIN_d.convs.1.nn.1.running_var", "GIN_d.convs.1.nn.1.num_batches_tracked", "GIN_d.convs.1.nn.3.weight", "GIN_d.convs.1.nn.3.bias", "GIN_d.convs.2.nn.0.weight", "GIN_d.convs.2.nn.0.bias", "GIN_d.convs.2.nn.1.weight", "GIN_d.convs.2.nn.1.bias", "GIN_d.convs.2.nn.1.running_mean", "GIN_d.convs.2.nn.1.running_var", "GIN_d.convs.2.nn.1.num_batches_tracked", "GIN_d.convs.2.nn.3.weight", "GIN_d.convs.2.nn.3.bias", "GIN_d.convs.3.nn.0.weight", "GIN_d.convs.3.nn.0.bias", "GIN_d.convs.3.nn.1.weight", "GIN_d.convs.3.nn.1.bias", "GIN_d.convs.3.nn.1.running_mean", "GIN_d.convs.3.nn.1.running_var", "GIN_d.convs.3.nn.1.num_batches_tracked", "GIN_d.convs.3.nn.3.weight", "GIN_d.convs.3.nn.3.bias".

Btw I just realised that the only difference between these two dictionaries are the "lins". Probably should not immediately use "strict=False" since it will ignore most of the keys then.

DrrDom commented 2 months ago

OK, I got it why I had different issues, I previously modified dimorphitewithin pkasolver to solve the issue below.

Originally, pkasolver runs dimorphite via subprocess. This implementation is not compatible with multiprocessing, because every molecules is processed separately and the result is saved in the file with the same name test.pkl for all molecules. Therefore, it can be overwritten at any time or corrupted. I tried to use Python interface of dimorphite, but there is an issue with command lines arguments, which I did not observe previously for some reason.

Trying to import the necessary function from custom Feriolet/dimorphite package does not work as well. There is an argument return_as_list which is not available in command line args but used by the authors of dimorphite.

Finally:

Current installation: pip install git+https://github.com/Feriolet/dimorphite_dl.git (after PR will be accepted) pip install git+https://github.com/DrrDom/pkasolver.git@noprints pip install git+https://github.com/ci-lab-cz/easydock.git@pkasolver2

Now it works for me. However, there are still errors in structures.

Example: input O=C(N1CCN(CC1)C(=O)C=2C=CC=C(C#CC3CC3)C2)C=4NN=C5CCCC45 output O=C(c1cccc(C#CC2CC2)c1)N1CC[NH](C(=O)c2[nH]nc3c2CCC3)CC1

It adds hydrogen to a nitrogen in a piperazine ring, but does not add charge. RDKit cannot parse this molecule as incorrect (a radical). Moreover, this protonation is definitely wrong, because this nitrogen cannot be protonated at pH 7.4. Such molecules are skipped from docking.

DrrDom commented 2 months ago

Please, check whether this implementation is compatible with computers with GPUs or there are still issues which should be solved.