ci-lab-cz / easydock

BSD 3-Clause "New" or "Revised" License
35 stars 13 forks source link

Database initialization takes too much time due to stereoisomer enumeration #38

Open DrrDom opened 4 months ago

DrrDom commented 4 months ago

For the --init process, yes I notice that the compound is initialized very slowly a long time ago because some molecules take a long time to generate the isomers. That's why to speed up the process, I tend to multiply the ncpu needed with the cpu in the config.yml for docking (I hardcoded it since I don't want to add up more argument to --init_db at that time), which speeds up the process in linear fashion if I remembered (it takes around 3 hours to initialize ~600k compound including isomers with 150 CPU).

Originally posted by @Feriolet in https://github.com/ci-lab-cz/easydock/issues/35#issuecomment-2122012253

DrrDom commented 4 months ago

Currently init_db function takes ncpu argument, which comes from the command line argument ncpu. The issue here is that the command line arg ncpu has different meaning if docking is launched on a single server or with dask on multiple servers. In dask-mode, this is the number of CPUs used for any other processing rather than docking. In docking on a single server this is additionally the number of molecules docked in parallel.

The obvious solution is to set ncpu in all functions to Pool.cpu_count() and a user will lose the control on those parts of a program and the control only on docking will remain. Not sure this is the best solution, but I do not see another option currently.

DrrDom commented 4 months ago

Another slow down is caused by not parallelized post-processing of molecules after protonation (in add_protonation), if molecules were submitted as 3D structures. There is an additional and time-consuming step of assigning correct bond orders. This can be also addressed in the context of this issue. I have a draft implementation to solve this, but did not test it yet.