Closed dwhswenson closed 4 years ago
Passes tests and maintains coverage. Ready for review.
~~Another memory reducing step that might be possible later on (which is implemented in the current version of the code):
We know sliced indices are continuous from 0
to len(all_atoms)-1
so you don't have to save the dict mapping both ways.~~
You can instead have a a = tuple(all_atoms)
where using a[sliced_idx]
gives you the real one and a.index(real_idx)
gives you the sliced one.
This should be done in a different PR though, and is no reason to block this one
Nevermind, looking at the performance numbers this should only be done one way (index
on tuples
is slow for big tuples
)
As I mentioned when atom slicing was first introduced, the extra code for atom slicing showed that there's a separation of concerns issue in the design -- we really need a separate class to manage indexing. In fact, I think the even better thing is to move all the caching of indices into a separate class just for that. This means also moving the
residue_ignore_atom_idxs
andresidue_query_atom_idxs
out of the mainContactObject
class, too.Overall, this PR removes a lot of things from the
ContactObject
API. However, all of these things will still be available undercontact_object.indexer
(with a few minor changes to usage):Changes to the public API (as documented in, e.g., the
ContactFrequency
docs)idx_to_s_idx()
: Now a dict calledindexer.sliced_idx
s_idx_to_idx()
: Now a dict calledindexer.real_idx
convert_atom_contacts()
=>indexer.convert_atom_contacts()
slice_trajectory()
=>indexer.slice_trajectory()
residue_query_atom_idxs
=>indexer.residue_query_atom_idxs
residue_ignore_atom_idxs
=>_residue_ignore_atom_idxs
(in theContactObject
, but private API)These aren't really user-facing in the same way that, e.g.,
atom_contacts
is. They're more about the internals (although they can be useful for sanity checks). I think this makes a more coherent and readable API. Similarly, I think it would be good to movecontact_map
private, but that's for another PR.This does have some effect on storage: currently we store and reload some of these dicts. I think that can just be removed without problem, though. Everything in the indexer can be derived from things we will still store (
query
,haystack
, andall_atoms
).Performance
I wanted to ensure that there wasn't any significant performance issue introduced by this approach. Here's a little script I wrote, based on the atom slice notebook:
performance.py
```python import timeit import cProfile import mdtraj as md from contact_map import ContactFrequency, ContactTrajectory, version traj = md.load("5550217/kras.xtc", top="5550217/kras.pdb")[:50] topology = traj.topology atoms = list(range(topology.n_atoms)) used_atoms = atoms[1:] heavies = topology.select("protein and element != H") print(f" Current commit: {version.full_version}") n_tries = 10 freq_test = "ContactFrequency(traj[0], query=used_atoms, haystack=used_atoms)" traj_test = "ContactTrajectory(traj, query=heavies, haystack=heavies)" ContactFrequency._class_use_atom_slice = True timing = timeit.Timer(freq_test, globals=globals()) times = timing.repeat(number=n_tries) t_sliced = min(times) print(f" Freq: Atom slice enabled: {t_sliced / n_tries}") cProfile.run(freq_test, 'with_atomslice.pstats') ContactFrequency._class_use_atom_slice = False timing = timeit.Timer(freq_test, globals=globals()) times = timing.repeat(number=n_tries) t_unsliced = min(times) print(f"Freq: Atom slice disabled: {t_unsliced / n_tries}") cProfile.run(freq_test, 'no_atomslice.pstats') ContactTrajectory._class_use_atom_slice = True timing = timeit.Timer(traj_test, globals=globals()) times = timing.repeat(number=5) t_unsliced = min(times) print(f" Traj: Atom slice enabled: {t_unsliced / 5}") cProfile.run(traj_test, 'traj.pstats') ```And output from a few commits:
For
ContactFrequency
, with or without atom slicing, that timing difference is within the errorbars. There's a small improvement inContactTrajectory
(although it was a more significant problem in a Real Research™ project of mine, where this PR also massively reduces memory usage). This is because I added the ability to provide the indexer to thefrom_contacts
constructor, which prevents it from being recreated for each frame. The overhead cost for atom slicing is entirely in theconvert_contacts
function, which is only called once perContactFrequency
(and I don't see any way to make it significantly faster).