datamol-io / datamol

Molecular Processing Made Easy.
https://docs.datamol.io
Apache License 2.0
452 stars 47 forks source link

Lazy loading #181

Closed hadim closed 1 year ago

hadim commented 1 year ago

Checklist:


This is an attempt to dramatically reduce datamol import time. The approach has been inspired (if not completely copy/pasted) from openff-toolkit (https://github.com/openforcefield/openff-toolkit/blob/b52879569a0344878c40248ceb3bd0f90348076a/openff/toolkit/__init__.py#L44).

On my laptop, it reduces the import time from 950 ms to 5 ms. Subsequent call to datamol functions do not seem slow (more tests needed here).

Lazy loading is only a first optimization to be done. Another one that could be tackled independently of this PR is to review and refactor the datamol submodules. For example, .mol depends on .convert and .convert also depends on .mol. Ideally, a sane tree hierarchy should be defined.

codecov[bot] commented 1 year ago

Codecov Report

Merging #181 (fc6587d) into main (624bab1) will increase coverage by 0.55%. The diff coverage is 75.00%.

@@            Coverage Diff             @@
##             main     #181      +/-   ##
==========================================
+ Coverage   91.02%   91.57%   +0.55%     
==========================================
  Files          48       46       -2     
  Lines        3801     3622     -179     
==========================================
- Hits         3460     3317     -143     
+ Misses        341      305      -36     
Flag Coverage Δ
unittests 91.57% <75.00%> (+0.55%) :arrow_up:

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
datamol/fragment/__init__.py 100.00% <ø> (ø)
datamol/fragment/_assemble.py 80.40% <50.00%> (+7.99%) :arrow_up:
datamol/convert.py 93.08% <100.00%> (+0.07%) :arrow_up:

... and 1 file with indirect coverage changes

:mega: We’re building smart automated test selection to slash your CI/CD build times. Learn more

hadim commented 1 year ago

Sorry to ping more people than usual but given that this PR might have unwanted side effects, I wanted to gather multiple opinions here.

Everything is supposed to be 100% backward compatible.

(please don't fill obligated to review here)

hadim commented 1 year ago

You can compare the loading time with:

# slow old mode
export DATAMOL_DISABLE_LAZY_LOADING=1
python scripts/datamol_duration.py

# lazy loading, new default mode
export DATAMOL_DISABLE_LAZY_LOADING=0
python scripts/datamol_duration.py
maclandrol commented 1 year ago
export DATAMOL_DISABLE_LAZY_LOADING=1
python scripts/datamol_duration.py

# lazy loading, new default mode
export DATAMOL_DISABLE_LAZY_LOADING=0
python scripts/datamol_duration.py

Was blazing fast at first try (-12s).

image

Then I tried again:

image

and again:

image

Then I tested with some concrete scenarios:

For example this:

import datamol as dm; mol = dm.to_mol("C1CCC(O)CC1"); mol = dm.reorder_atoms(mol); dm.to_image(mol)
image

I will investigate further (especially datamol functions in some dm.parallelized setups).

On another note, I also try testing the effect on plugin system for molfeat with the different scenarios of using molfeat-padel. I get inconclusive results. 3s off in some setting, 2sup in others (~15s in average for initialization).

hadim commented 1 year ago

Thanks for testing it!

Indeed I am not excepting this PR to fix things like import datamol as dm; mol = dm.to_mol("C1CCC(O)CC1"); mol = dm.reorder_atoms(mol); dm.to_image(mol) because with those 3 functions only, pretty much all the datamol modules are loaded (at least the ones that are time-consuming to load).

So, another effort that should be done in a separated PR, is to review how all the datamol modules are organized, get rid of all the circular imports, and refactor the module layout to make them more granular and efficient to load (a concrete example would be to isolate dm.to_mol() since it's widely used and potentially quite an independent function).

So in the end that PR will show import performance "in appearance" but as soon as your code starts importing a couple of functions, it's not impossible you'll have to pay the import overhead at some point.

maclandrol commented 1 year ago

Overall it's still better compared to the old system, so I would say this is a win. I will run the additional tests in multiprocessing/threading a bit later today.

maclandrol commented 1 year ago

For the following:


import datamol as dm

def run_method(m):
    m = dm.to_mol(m)
    m = dm.keep_largest_fragment(m)
    mass = dm.descriptors.mw(m)
    img = dm.to_image(m)
    return mass

data = dm.freesolv().iloc[:100]["smiles"].values
out = dm.parallelized(run_method, data, n_jobs=-1, scheduler="processes") # or "threads"
Lazy Loading Disabled Lazy Loading Enabled
processes 14.64±2.02s 9.45±0.65s
threads 4.59±0.08s 4.18±0.03

Seems pretty good IMO

FNTwin commented 1 year ago

I tested in on some MD parametrization routines and as Manu said, this is indeed a win.

Lazy Loading Disabled Lazy Loading Enabled
6.08±0.97s 4.78±0.05s
hadim commented 1 year ago

Is this worth a major bump?

I don't think it deserves bumping to 1.x.x. but definitively yes to bump to 0.10.x.

Should this live as an opt-in feature for a while before making it the default?

It could be an option indeed. On my side I prefer to simply enable it by default, so we can gather potential bugs faster (at the cost of potentially introducing unwanted bugs in user's workload for sure). But I am confident enough that it's a safe bet.

hadim commented 1 year ago

I have no issues with the changes. I think it's a good thing to have in, and I would set it as the default behavior to make sure we catch problems faster.

Maybe this is naive, but it does feel to me though, that the problem is brought upon by the pattern of making almost every module/class available at the root module level, effectively coupling everything in the library together. Doing it this way, while hiding the internal module structure, means we're losing one of the big advantages of Python as a dynamic language: load only what is actually needed by the client program.

So could we solve the problem instead by not doing that?

I'm asking because this approach, by shifting the load latency to the function invocation, rather than at import time, makes the timing of that function invocation more variable. I'm a big fan of the concept of constant work, as a tool to increase reliability in software systems. In that approach, I'd prefer the pay the load latency up front, so that my function calls can be constant, rather than having to handle variable latency at invoke time.

Anyway, this is probably good for now. smile

Thanks a lot Julien! As discussed earlier today, the goal of having all the functions available in a single import was to mimic the numpy and pandas API. Those two libraries do not suffer from such long import time, and it was clearly naive to think we could simply do the same for datamol. That being said, I think it's really an asset of datamol compared to rdkit to be able to browse and use almost 100% of the API with a single import.

So instead of reverting it back to more focused imports (but many of them), I feel it's worth trying to reduce and optimize the import time.

As said above, this PR is only a first step toward faster import time, but the following step will be to rethink the internal datamol modules structures, so when you ask for something like dm.to_mol then only the required functions will be loaded without loading many different modules.

I don't think we'll never get to import time as fast as numpy and pandas due to the nature of the lib and its dependency but I am quite sure we can improve even more upon that PR.

hadim commented 1 year ago

Thank you so much everyone for your inputs and testing this PR!

I will free the beast :-D