dwhswenson / contact_map

Contact map analysis for biomolecules; based on MDTraj
GNU Lesser General Public License v2.1
40 stars 17 forks source link

IndexManagers for atom slicing #87

Closed dwhswenson closed 4 years ago

dwhswenson commented 4 years ago

As I mentioned when atom slicing was first introduced, the extra code for atom slicing showed that there's a separation of concerns issue in the design -- we really need a separate class to manage indexing. In fact, I think the even better thing is to move all the caching of indices into a separate class just for that. This means also moving the residue_ignore_atom_idxs and residue_query_atom_idxs out of the main ContactObject class, too.

Overall, this PR removes a lot of things from the ContactObject API. However, all of these things will still be available under contact_object.indexer (with a few minor changes to usage):

Changes to the public API (as documented in, e.g., the ContactFrequency docs)

These aren't really user-facing in the same way that, e.g., atom_contacts is. They're more about the internals (although they can be useful for sanity checks). I think this makes a more coherent and readable API. Similarly, I think it would be good to move contact_map private, but that's for another PR.

This does have some effect on storage: currently we store and reload some of these dicts. I think that can just be removed without problem, though. Everything in the indexer can be derived from things we will still store (query, haystack, and all_atoms).

Performance

I wanted to ensure that there wasn't any significant performance issue introduced by this approach. Here's a little script I wrote, based on the atom slice notebook:

performance.py ```python import timeit import cProfile import mdtraj as md from contact_map import ContactFrequency, ContactTrajectory, version traj = md.load("5550217/kras.xtc", top="5550217/kras.pdb")[:50] topology = traj.topology atoms = list(range(topology.n_atoms)) used_atoms = atoms[1:] heavies = topology.select("protein and element != H") print(f" Current commit: {version.full_version}") n_tries = 10 freq_test = "ContactFrequency(traj[0], query=used_atoms, haystack=used_atoms)" traj_test = "ContactTrajectory(traj, query=heavies, haystack=heavies)" ContactFrequency._class_use_atom_slice = True timing = timeit.Timer(freq_test, globals=globals()) times = timing.repeat(number=n_tries) t_sliced = min(times) print(f" Freq: Atom slice enabled: {t_sliced / n_tries}") cProfile.run(freq_test, 'with_atomslice.pstats') ContactFrequency._class_use_atom_slice = False timing = timeit.Timer(freq_test, globals=globals()) times = timing.repeat(number=n_tries) t_unsliced = min(times) print(f"Freq: Atom slice disabled: {t_unsliced / n_tries}") cProfile.run(freq_test, 'no_atomslice.pstats') ContactTrajectory._class_use_atom_slice = True timing = timeit.Timer(traj_test, globals=globals()) times = timing.repeat(number=5) t_unsliced = min(times) print(f" Traj: Atom slice enabled: {t_unsliced / 5}") cProfile.run(traj_test, 'traj.pstats') ```

And output from a few commits:

     Current commit: 0.6.0+g2c8d437  # release 0.6.0
 Freq: Atom slice enabled: 0.15540173510000005
Freq: Atom slice disabled: 0.12252338769999974
 Traj: Atom slice enabled: 1.7406494088000002
     Current commit: 0.6.1.dev0+g0f5201b  # this PR
 Freq: Atom slice enabled: 0.1577422553
Freq: Atom slice disabled: 0.12046521950000014
 Traj: Atom slice enabled: 1.675931647400001

For ContactFrequency, with or without atom slicing, that timing difference is within the errorbars. There's a small improvement in ContactTrajectory (although it was a more significant problem in a Real Research™ project of mine, where this PR also massively reduces memory usage). This is because I added the ability to provide the indexer to the from_contacts constructor, which prevents it from being recreated for each frame. The overhead cost for atom slicing is entirely in the convert_contacts function, which is only called once per ContactFrequency (and I don't see any way to make it significantly faster).

dwhswenson commented 4 years ago

Passes tests and maintains coverage. Ready for review.

sroet commented 4 years ago

~~Another memory reducing step that might be possible later on (which is implemented in the current version of the code): We know sliced indices are continuous from 0 to len(all_atoms)-1 so you don't have to save the dict mapping both ways.~~

You can instead have a a = tuple(all_atoms) where using a[sliced_idx] gives you the real one and a.index(real_idx) gives you the sliced one.

This should be done in a different PR though, and is no reason to block this one

Nevermind, looking at the performance numbers this should only be done one way (index on tuples is slow for big tuples)