dwhswenson / contact_map

Contact map analysis for biomolecules; based on MDTraj
GNU Lesser General Public License v2.1
42 stars 18 forks source link

DaskContactTrajectory #101

Closed sroet closed 3 years ago

sroet commented 4 years ago

This PR includes:

Original WIP comment below:

Things required:

~@dwhswenson There are two fast options on how to fix the the loading issue:~ ~1) load outside of dask and scatter~ ~2) Load the whole trajectory as a pure task in dask (which understands that this load task would always return the same data so it does not repeat) and do the slicing as a separate task~

nvm, we solved this issue already for DaskContactFrequency, it tries to load in 1 chunk per worker.

Longer term we might need to think on how to prevent the requirement that the whole trajectory needs to fit into memory (Like with a skip with a single call to mdtraj.iterload())

sroet commented 4 years ago

To indicate why the loading is bad: Screenshot from 2020-11-04 22-38-24

It is also clearly visible from the task stream (purple are 101 md.load(traj)[slice] calls)

Screenshot from 2020-11-04 22-40-22

sroet commented 4 years ago

It is better now: Screenshot from 2020-11-11 23-27-38

With the following task loading Screenshot from 2020-11-11 23-29-32

dwhswenson commented 4 years ago

Looks pretty good so far. One thing to keep in mind is that the use case for this is really going to be multiple nodes -- for same-node parallelization, MDTraj already uses OpenMM, and I think we can eventually figure out how to get numba to handle the loops in ContactObject._contact_map. So per-frame, we'll have good parallelism in shared memory. But the loop over frames can be spread over nodes.

Longer term we might need to think on how to prevent the requirement that the whole trajectory needs to fit into memory (Like with a skip with a single call to mdtraj.iterload())

Agreed. One thing I was thinking about is that it also might be nice to allow multiple files. In practice, this is what I see people do with very large trajectories. For example, a DESRES trajectory I've been playing with is 100 files with 1000 frames each. A list of filenames instead of just a single filename would be reasonable input here.

sroet commented 4 years ago

Looks pretty good so far. One thing to keep in mind is that the use case for this is really going to be multiple nodes -- for same-node parallelization, MDTraj already uses OpenMM, and I think we can eventually figure out how to get numba to handle the loops in ContactObject._contact_map. So per-frame, we'll have good parallelism in shared memory. But the loop over frames can be spread over nodes.

I know, but the reason I was hesitant of the first implementation was because 3x slowdown was way more than the one for DaskContactFrequency. I am pretty happy with the current implementation and it should not change to much until the WIP disappears.

Agreed. One thing I was thinking about is that it also might be nice to allow multiple files. In practice, this is what I see people do with very large trajectories. For example, a DESRES trajectory I've been playing with is 100 files with 1000 frames each. A list of filenames instead of just a single filename would be reasonable input here.

That is a bit out-of-scope for this PR (in my opinion), but I will make an issue to track a smarter loading implementation.

sroet commented 4 years ago

@dwhswenson This is ready for a review

sroet commented 4 years ago

(The first comment now gives an overview of changes and additions in this PR)

codecov[bot] commented 3 years ago

Codecov Report

Merging #101 (9d56908) into master (65cf1de) will increase coverage by 0.01%. The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #101      +/-   ##
==========================================
+ Coverage   99.52%   99.53%   +0.01%     
==========================================
  Files          13       13              
  Lines        1043     1070      +27     
==========================================
+ Hits         1038     1065      +27     
  Misses          5        5              
Impacted Files Coverage Δ
contact_map/__init__.py 100.00% <100.00%> (ø)
contact_map/contact_trajectory.py 100.00% <100.00%> (ø)
contact_map/dask_runner.py 100.00% <100.00%> (ø)
contact_map/frequency_task.py 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 65cf1de...9d56908. Read the comment docs.

sroet commented 3 years ago

@dwhswenson , friendly monthly ping to make sure it is still on a list somewhere :)

sroet commented 3 years ago

@dwhswenson After #106 I am not too certain what to do with the notebook that is added here (examples/dask_contact_trajectory.ipynb ), do we want to wrap that into integrations.ipynb? (which seems to already reference functionality from this PR)

dwhswenson commented 3 years ago

Yeah, I think that makes sense. As that section gets longer, I'm thinking to split that notebook into exporting_data.ipynb and performance.ipynb -- performance might also eventually include numba integration, which may have a force-off switch like atom_slice

dwhswenson commented 3 years ago

(in other words, feel free to make that split)

sroet commented 3 years ago

(in other words, feel free to make that split)

Will do, in the meantime I have no clue why codecov suddenly thinks all the dask code is not hit (it claims to be covered localy...)

dwhswenson commented 3 years ago

I have no clue why codecov suddenly thinks all the dask code is not hit

It may be just slow to catch it. We only run optional integrations in the mdtraj-dev build, which takes longer to install. I had that happen on a PR -- eventually it caught up

sroet commented 3 years ago

It may be just slow to catch it. We only run optional integrations in the mdtraj-dev build, which takes longer to install. I had that happen on a PR -- eventually it caught up

Nope, seems to be a similar issue as was solved already on openpathsampling (can't find the PR back, however)

Issue detecting commit SHA. Please run actions/checkout with fetch-depth > 1 or set to 0,

https://github.com/dwhswenson/contact_map/pull/101/commits/2a2dc38ce57e776a51511a438df392e7274e49b5 seems to fix that issue (and GA complaining about not knowing auto-update-python). I can cherry pick that one over to it's own PR, if you want

dwhswenson commented 3 years ago

Sure, please do cherry pick it over. That's a PR I can actually promise to review tonight!

sroet commented 3 years ago

Sure, please do cherry pick it over. That's a PR I can actually promise to review tonight!

No rush. This is just a generic maintenance evening for me, it is just nice to have these PRs ready to go (again) for whenever you have some time.

sroet commented 3 years ago

Yeah, I think that makes sense. As that section gets longer, I'm thinking to split that notebook into exporting_data.ipynb and performance.ipynb -- performance might also eventually include numba integration, which may have a force-off switch like atom_slice

I did split them out, added DaskContactTrajectory and updated the doc building.

One thing that I don't like on my local doc build is that the sidebar behaves erratically (If you open one of the notebooks it permanently hides "Exporting data to other tools" until you click on "Userguide/Examples" again (not the toggle, but the actual link))

sroet commented 3 years ago

Alright, this should be merge-able again

dwhswenson commented 3 years ago

One thing that I don't like on my local doc build is that the sidebar behaves erratically

After make clean && make html? The default Makefile is pretty conservative in what it changes (to keep build times fast). Regularly messes up the sidebar.

sroet commented 3 years ago

After make clean && make html? The default Makefile is pretty conservative in what it changes (to keep build times fast). Regularly messes up the sidebar.

That solved the issue, thanks!