Closed sroet closed 3 years ago
To indicate why the loading is bad:
It is also clearly visible from the task stream (purple are 101 md.load(traj)[slice]
calls)
It is better now:
With the following task loading
Looks pretty good so far. One thing to keep in mind is that the use case for this is really going to be multiple nodes -- for same-node parallelization, MDTraj already uses OpenMM, and I think we can eventually figure out how to get numba to handle the loops in ContactObject._contact_map
. So per-frame, we'll have good parallelism in shared memory. But the loop over frames can be spread over nodes.
Longer term we might need to think on how to prevent the requirement that the whole trajectory needs to fit into memory (Like with a skip with a single call to mdtraj.iterload())
Agreed. One thing I was thinking about is that it also might be nice to allow multiple files. In practice, this is what I see people do with very large trajectories. For example, a DESRES trajectory I've been playing with is 100 files with 1000 frames each. A list of filenames instead of just a single filename would be reasonable input here.
Looks pretty good so far. One thing to keep in mind is that the use case for this is really going to be multiple nodes -- for same-node parallelization, MDTraj already uses OpenMM, and I think we can eventually figure out how to get numba to handle the loops in ContactObject._contact_map. So per-frame, we'll have good parallelism in shared memory. But the loop over frames can be spread over nodes.
I know, but the reason I was hesitant of the first implementation was because 3x slowdown was way more than the one for DaskContactFrequency
. I am pretty happy with the current implementation and it should not change to much until the WIP disappears.
Agreed. One thing I was thinking about is that it also might be nice to allow multiple files. In practice, this is what I see people do with very large trajectories. For example, a DESRES trajectory I've been playing with is 100 files with 1000 frames each. A list of filenames instead of just a single filename would be reasonable input here.
That is a bit out-of-scope for this PR (in my opinion), but I will make an issue to track a smarter loading implementation.
@dwhswenson This is ready for a review
(The first comment now gives an overview of changes and additions in this PR)
Merging #101 (9d56908) into master (65cf1de) will increase coverage by
0.01%
. The diff coverage is100.00%
.
@@ Coverage Diff @@
## master #101 +/- ##
==========================================
+ Coverage 99.52% 99.53% +0.01%
==========================================
Files 13 13
Lines 1043 1070 +27
==========================================
+ Hits 1038 1065 +27
Misses 5 5
Impacted Files | Coverage Δ | |
---|---|---|
contact_map/__init__.py | 100.00% <100.00%> (ø) |
|
contact_map/contact_trajectory.py | 100.00% <100.00%> (ø) |
|
contact_map/dask_runner.py | 100.00% <100.00%> (ø) |
|
contact_map/frequency_task.py | 100.00% <100.00%> (ø) |
Continue to review full report at Codecov.
Legend - Click here to learn more
Δ = absolute <relative> (impact)
,ø = not affected
,? = missing data
Powered by Codecov. Last update 65cf1de...9d56908. Read the comment docs.
@dwhswenson , friendly monthly ping to make sure it is still on a list somewhere :)
@dwhswenson After #106 I am not too certain what to do with the notebook that is added here (examples/dask_contact_trajectory.ipynb ), do we want to wrap that into integrations.ipynb
? (which seems to already reference functionality from this PR)
Yeah, I think that makes sense. As that section gets longer, I'm thinking to split that notebook into exporting_data.ipynb
and performance.ipynb
-- performance might also eventually include numba integration, which may have a force-off switch like atom_slice
(in other words, feel free to make that split)
(in other words, feel free to make that split)
Will do, in the meantime I have no clue why codecov suddenly thinks all the dask code is not hit (it claims to be covered localy...)
I have no clue why codecov suddenly thinks all the dask code is not hit
It may be just slow to catch it. We only run optional integrations in the mdtraj-dev build, which takes longer to install. I had that happen on a PR -- eventually it caught up
It may be just slow to catch it. We only run optional integrations in the mdtraj-dev build, which takes longer to install. I had that happen on a PR -- eventually it caught up
Nope, seems to be a similar issue as was solved already on openpathsampling (can't find the PR back, however)
Issue detecting commit SHA. Please run actions/checkout with fetch-depth > 1 or set to 0,
https://github.com/dwhswenson/contact_map/pull/101/commits/2a2dc38ce57e776a51511a438df392e7274e49b5 seems to fix that issue (and GA complaining about not knowing auto-update-python
). I can cherry pick that one over to it's own PR, if you want
Sure, please do cherry pick it over. That's a PR I can actually promise to review tonight!
Sure, please do cherry pick it over. That's a PR I can actually promise to review tonight!
No rush. This is just a generic maintenance evening for me, it is just nice to have these PRs ready to go (again) for whenever you have some time.
Yeah, I think that makes sense. As that section gets longer, I'm thinking to split that notebook into exporting_data.ipynb and performance.ipynb -- performance might also eventually include numba integration, which may have a force-off switch like atom_slice
I did split them out, added DaskContactTrajectory
and updated the doc building.
One thing that I don't like on my local doc build is that the sidebar behaves erratically (If you open one of the notebooks it permanently hides "Exporting data to other tools" until you click on "Userguide/Examples" again (not the toggle, but the actual link))
Alright, this should be merge-able again
One thing that I don't like on my local doc build is that the sidebar behaves erratically
After make clean && make html
? The default Makefile
is pretty conservative in what it changes (to keep build times fast). Regularly messes up the sidebar.
After make clean && make html? The default Makefile is pretty conservative in what it changes (to keep build times fast). Regularly messes up the sidebar.
That solved the issue, thanks!
This PR includes:
non-public API break
]ContactTrajectory._build_contacts()
Now returns oneiterator
ofn_frames
with(frame_n.atom_contacts, frame_n.residue_contacts)
for eachframe
instead oflist of atom_contacts, list residue_contacts
(this was necessary for the least amount of duplicate code between the dask implementation andContactTrajectory
, but can be reverted if required).feature
] AddedDaskContactTrajectory
(thedask
implementation forContactTrajectory
) and some related convenience functionsmisc
] Add cluster shutdown to the DaskContactFrequency exampleOriginal WIP comment below:
Things required:
n+1
times forn
slices))~ContactTrajectory
~@dwhswenson There are two fast options on how to fix the the loading issue:~ ~1) load outside of
dask
andscatter
~ ~2) Load the whole trajectory as apure
task in dask (which understands that thisload
task would always return the same data so it does not repeat) and do the slicing as a separate task~nvm, we solved this issue already for
DaskContactFrequency
, it tries to load in 1 chunk per worker.Longer term we might need to think on how to prevent the requirement that the whole trajectory needs to fit into memory (Like with a
skip
with a single call tomdtraj.iterload()
)