IndexManagers for atom slicing

dwhswenson commented 4 years ago

As I mentioned when atom slicing was first introduced, the extra code for atom slicing showed that there's a separation of concerns issue in the design -- we really need a separate class to manage indexing. In fact, I think the even better thing is to move all the caching of indices into a separate class just for that. This means also moving the residue_ignore_atom_idxs and residue_query_atom_idxs out of the main ContactObject class, too.

Overall, this PR removes a lot of things from the ContactObject API. However, all of these things will still be available under contact_object.indexer (with a few minor changes to usage):

Changes to the public API (as documented in, e.g., the ContactFrequency docs)

idx_to_s_idx(): Now a dict called indexer.sliced_idx
s_idx_to_idx(): Now a dict called indexer.real_idx
convert_atom_contacts() => indexer.convert_atom_contacts()
slice_trajectory() => indexer.slice_trajectory()
residue_query_atom_idxs => indexer.residue_query_atom_idxs
residue_ignore_atom_idxs => _residue_ignore_atom_idxs (in the ContactObject, but private API)

These aren't really user-facing in the same way that, e.g., atom_contacts is. They're more about the internals (although they can be useful for sanity checks). I think this makes a more coherent and readable API. Similarly, I think it would be good to move contact_map private, but that's for another PR.

This does have some effect on storage: currently we store and reload some of these dicts. I think that can just be removed without problem, though. Everything in the indexer can be derived from things we will still store (query, haystack, and all_atoms).

Performance

I wanted to ensure that there wasn't any significant performance issue introduced by this approach. Here's a little script I wrote, based on the atom slice notebook:

performance.py

```python import timeit import cProfile import mdtraj as md from contact_map import ContactFrequency, ContactTrajectory, version traj = md.load("5550217/kras.xtc", top="5550217/kras.pdb")[:50] topology = traj.topology atoms = list(range(topology.n_atoms)) used_atoms = atoms[1:] heavies = topology.select("protein and element != H") print(f" Current commit: {version.full_version}") n_tries = 10 freq_test = "ContactFrequency(traj[0], query=used_atoms, haystack=used_atoms)" traj_test = "ContactTrajectory(traj, query=heavies, haystack=heavies)" ContactFrequency._class_use_atom_slice = True timing = timeit.Timer(freq_test, globals=globals()) times = timing.repeat(number=n_tries) t_sliced = min(times) print(f" Freq: Atom slice enabled: {t_sliced / n_tries}") cProfile.run(freq_test, 'with_atomslice.pstats') ContactFrequency._class_use_atom_slice = False timing = timeit.Timer(freq_test, globals=globals()) times = timing.repeat(number=n_tries) t_unsliced = min(times) print(f"Freq: Atom slice disabled: {t_unsliced / n_tries}") cProfile.run(freq_test, 'no_atomslice.pstats') ContactTrajectory._class_use_atom_slice = True timing = timeit.Timer(traj_test, globals=globals()) times = timing.repeat(number=5) t_unsliced = min(times) print(f" Traj: Atom slice enabled: {t_unsliced / 5}") cProfile.run(traj_test, 'traj.pstats') ```

And output from a few commits:

     Current commit: 0.6.0+g2c8d437  # release 0.6.0
 Freq: Atom slice enabled: 0.15540173510000005
Freq: Atom slice disabled: 0.12252338769999974
 Traj: Atom slice enabled: 1.7406494088000002

     Current commit: 0.6.1.dev0+g0f5201b  # this PR
 Freq: Atom slice enabled: 0.1577422553
Freq: Atom slice disabled: 0.12046521950000014
 Traj: Atom slice enabled: 1.675931647400001

For ContactFrequency, with or without atom slicing, that timing difference is within the errorbars. There's a small improvement in ContactTrajectory (although it was a more significant problem in a Real Research™ project of mine, where this PR also massively reduces memory usage). This is because I added the ability to provide the indexer to the from_contacts constructor, which prevents it from being recreated for each frame. The overhead cost for atom slicing is entirely in the convert_contacts function, which is only called once per ContactFrequency (and I don't see any way to make it significantly faster).

dwhswenson commented 4 years ago

Passes tests and maintains coverage. Ready for review.

sroet commented 4 years ago

~~Another memory reducing step that might be possible later on (which is implemented in the current version of the code): We know sliced indices are continuous from 0 to len(all_atoms)-1 so you don't have to save the dict mapping both ways.~~

~~You can instead have a a = tuple(all_atoms) where using a[sliced_idx] gives you the real one and a.index(real_idx) gives you the sliced one.~~

~~This should be done in a different PR though, and is no reason to block this one~~

Nevermind, looking at the performance numbers this should only be done one way (index on tuples is slow for big tuples)

dwhswenson / contact_map

IndexManagers for atom slicing #87

Performance