Closed hadim closed 1 year ago
Merging #181 (fc6587d) into main (624bab1) will increase coverage by
0.55%
. The diff coverage is75.00%
.
@@ Coverage Diff @@
## main #181 +/- ##
==========================================
+ Coverage 91.02% 91.57% +0.55%
==========================================
Files 48 46 -2
Lines 3801 3622 -179
==========================================
- Hits 3460 3317 -143
+ Misses 341 305 -36
Flag | Coverage Δ | |
---|---|---|
unittests | 91.57% <75.00%> (+0.55%) |
:arrow_up: |
Flags with carried forward coverage won't be shown. Click here to find out more.
Impacted Files | Coverage Δ | |
---|---|---|
datamol/fragment/__init__.py | 100.00% <ø> (ø) |
|
datamol/fragment/_assemble.py | 80.40% <50.00%> (+7.99%) |
:arrow_up: |
datamol/convert.py | 93.08% <100.00%> (+0.07%) |
:arrow_up: |
... and 1 file with indirect coverage changes
:mega: We’re building smart automated test selection to slash your CI/CD build times. Learn more
Sorry to ping more people than usual but given that this PR might have unwanted side effects, I wanted to gather multiple opinions here.
Everything is supposed to be 100% backward compatible.
(please don't fill obligated to review here)
You can compare the loading time with:
# slow old mode
export DATAMOL_DISABLE_LAZY_LOADING=1
python scripts/datamol_duration.py
# lazy loading, new default mode
export DATAMOL_DISABLE_LAZY_LOADING=0
python scripts/datamol_duration.py
export DATAMOL_DISABLE_LAZY_LOADING=1 python scripts/datamol_duration.py # lazy loading, new default mode export DATAMOL_DISABLE_LAZY_LOADING=0 python scripts/datamol_duration.py
Was blazing fast at first try (-12s
).
Then I tried again:
and again:
Then I tested with some concrete scenarios:
For example this:
import datamol as dm; mol = dm.to_mol("C1CCC(O)CC1"); mol = dm.reorder_atoms(mol); dm.to_image(mol)
I will investigate further (especially datamol
functions in some dm.parallelized
setups).
On another note, I also try testing the effect on plugin system for molfeat with the different scenarios of using molfeat-padel
. I get inconclusive results. 3s
off in some setting, 2s
up in others (~15s in average for initialization).
Thanks for testing it!
Indeed I am not excepting this PR to fix things like import datamol as dm; mol = dm.to_mol("C1CCC(O)CC1"); mol = dm.reorder_atoms(mol); dm.to_image(mol)
because with those 3 functions only, pretty much all the datamol modules are loaded (at least the ones that are time-consuming to load).
So, another effort that should be done in a separated PR, is to review how all the datamol modules are organized, get rid of all the circular imports, and refactor the module layout to make them more granular and efficient to load (a concrete example would be to isolate dm.to_mol()
since it's widely used and potentially quite an independent function).
So in the end that PR will show import performance "in appearance" but as soon as your code starts importing a couple of functions, it's not impossible you'll have to pay the import overhead at some point.
Overall it's still better compared to the old system, so I would say this is a win. I will run the additional tests in multiprocessing/threading a bit later today.
For the following:
import datamol as dm
def run_method(m):
m = dm.to_mol(m)
m = dm.keep_largest_fragment(m)
mass = dm.descriptors.mw(m)
img = dm.to_image(m)
return mass
data = dm.freesolv().iloc[:100]["smiles"].values
out = dm.parallelized(run_method, data, n_jobs=-1, scheduler="processes") # or "threads"
Lazy Loading Disabled | Lazy Loading Enabled | |
---|---|---|
processes |
14.64±2.02s | 9.45±0.65s |
threads |
4.59±0.08s | 4.18±0.03 |
Seems pretty good IMO
I tested in on some MD parametrization routines and as Manu said, this is indeed a win.
Lazy Loading Disabled | Lazy Loading Enabled |
---|---|
6.08±0.97s | 4.78±0.05s |
Is this worth a major bump?
I don't think it deserves bumping to 1.x.x
. but definitively yes to bump to 0.10.x
.
Should this live as an opt-in feature for a while before making it the default?
It could be an option indeed. On my side I prefer to simply enable it by default, so we can gather potential bugs faster (at the cost of potentially introducing unwanted bugs in user's workload for sure). But I am confident enough that it's a safe bet.
I have no issues with the changes. I think it's a good thing to have in, and I would set it as the default behavior to make sure we catch problems faster.
Maybe this is naive, but it does feel to me though, that the problem is brought upon by the pattern of making almost every module/class available at the root module level, effectively coupling everything in the library together. Doing it this way, while hiding the internal module structure, means we're losing one of the big advantages of Python as a dynamic language: load only what is actually needed by the client program.
So could we solve the problem instead by not doing that?
I'm asking because this approach, by shifting the load latency to the function invocation, rather than at import time, makes the timing of that function invocation more variable. I'm a big fan of the concept of constant work, as a tool to increase reliability in software systems. In that approach, I'd prefer the pay the load latency up front, so that my function calls can be constant, rather than having to handle variable latency at invoke time.
Anyway, this is probably good for now. smile
Thanks a lot Julien! As discussed earlier today, the goal of having all the functions available in a single import was to mimic the numpy and pandas API. Those two libraries do not suffer from such long import time, and it was clearly naive to think we could simply do the same for datamol. That being said, I think it's really an asset of datamol compared to rdkit to be able to browse and use almost 100% of the API with a single import.
So instead of reverting it back to more focused imports (but many of them), I feel it's worth trying to reduce and optimize the import time.
As said above, this PR is only a first step toward faster import time, but the following step will be to rethink the internal datamol modules structures, so when you ask for something like dm.to_mol
then only the required functions will be loaded without loading many different modules.
I don't think we'll never get to import time as fast as numpy and pandas due to the nature of the lib and its dependency but I am quite sure we can improve even more upon that PR.
Thank you so much everyone for your inputs and testing this PR!
I will free the beast :-D
Checklist:
news
entry.news/TEMPLATE.rst
tonews/my-feature-or-branch.rst
) and edit it.This is an attempt to dramatically reduce datamol import time. The approach has been inspired (if not completely copy/pasted) from openff-toolkit (https://github.com/openforcefield/openff-toolkit/blob/b52879569a0344878c40248ceb3bd0f90348076a/openff/toolkit/__init__.py#L44).
On my laptop, it reduces the import time from 950 ms to 5 ms. Subsequent call to datamol functions do not seem slow (more tests needed here).
Lazy loading is only a first optimization to be done. Another one that could be tackled independently of this PR is to review and refactor the datamol submodules. For example,
.mol
depends on.convert
and.convert
also depends on.mol
. Ideally, a sane tree hierarchy should be defined.