Work on Topology system

GoogleCodeExporter commented 9 years ago

Overhauled the Topology system I'd originally introduced.
Previous system had AtomGroup.bondDict which returned a TopologyDict which 
itself returned a TopologyGroup, which was confusing even to me.
Now everything works using TopologyGroup only, TopologyDict is instead a part 
of the TopologyGroup and is mostly invisible to users.

Other cool new feature is you can select Bonds/Angles/Torsions based on an 
AtomGroup.

Universe has .bonds .angles and .torsions attributes which are TopologyGroup 
objects which act as master lists of the topology objects
!! These are created on Universe creation and are the only change which affects 
people that don't want to use any of these changes.
!! This changes previous behaviour where universe.bonds was a list of Bond 
instances

AtomGroups now have .bonds .angles and .torsions attributes which are generated 
lazily into cache from the universe master lists.
These are TopologyGroups, which is designed to be a container analogous to 
AtomGroup.

TopologyGroup has:
 - shortcut to the Cython calculations calc_bonds calc_angles calc_torsions through .bonds() .angles() and .torsions() methods
 - selectBonds method to retrieve bonds based either on
   - type tuple eg ag.bonds.selectBonds(('C', 'H')) to select all C-H bonds from an AtomGroup
   - Atom instance, ag.bonds.selectBonds(Atom) retrieves all bonds which feature this atom
   - AtomGroup ag.bonds.selectBonds(AtomGroup) retrieves all bonds which are fully contained by the AtomGroup
 - __getitem__ shortcuts to the .selectBonds method
Examples of using this are given below.

Regarding performance, I benchmarked loading a Universe with & without topology 
information before & after the changes I've made:
This is a system of 18000 atoms with 18000 bonds, 35000 angles and 8000 
torsions.

I'm going to be looking into improving the performance of this while it's still 
a draft.

BEFORE:
With topology:
In [2]: %timeit -n10 a = mda.Universe(psffile, trjfile)
10 loops, best of 3: 380 ms per loop
Without topology:
In [1]: %timeit -n10 a = mda.Universe('atom_notop.psf', trjfile)
10 loops, best of 3: 79.3 ms per loop

AFTER:
With topology:
In [7]: %timeit -n10 a = mda.Universe(psffile, trjfile)
10 loops, best of 3: 356 ms per loop
Without topology:
In [1]: %timeit -n10 a = mda.Universe('atom_notop.psf', trjfile)
10 loops, best of 3: 77.8 ms per loop

Would like to hear if this makes sense and any suggested changes or complaints!

Example of using new system:

In [1]: u
Out[1]: <Universe with 18360 atoms and 18336 bonds>

# u.bonds now a special container for bonds
In [2]: u.bonds
Out[2]: < TopologyGroup containing 18336 bonds >

# can view which bonds my universe has
# this is done lazily as it's slow to create
# the names of these types are based on the .type attribute of the atoms
In [4]: u.bonds.types()
Out[4]: 
[('4', '5'),
 ('4', '6'),
 ('3', '1'),
 ('5', '2'),
 ('3', '3'),
 ('5', '3'),
 ('3', '4')]

# make a selection based on type
In [5]: tg = u.bonds[('4', '5')]

In [6]: tg
Out[6]: < TopologyGroup containing 960 bonds >  

# length of all bonds as a method of the TopologyGroup
In [7]: tg.bonds()
Out[7]: 
array([ 1.32992741,  1.35190573,  1.36695271,  .....

# Create an AtomGroup
In [8]: ag = u.atoms[:100]

# View the angles contained in this AtomGroup
# this creates a smaller TopologyGroup from the universe
# it's a little slow currently, but it's a cached attribute
In [9]: ag.angles
Out[9]: < TopologyGroup containing 184 angles >

# View which angles I have
In [12]: ag.angles.types()
Out[12]: 
[('4', '5', '3'),
 ('5', '3', '1'),
 ('1', '3', '4'),
 ('3', '3', '3'),
 ('4', '5', '2'),
 ('5', '4', '3'),
 ('1', '3', '1'),
 ('5', '3', '3'),
 ('3', '3', '4'),
 ('5', '4', '6'),
 ('3', '5', '2'),
 ('1', '3', '3'),
 ('6', '4', '3')]

# Select again, note that the selection criteria this time is 'backwards' 
compared to what was reported by .types(), 
# this doesn't matter as an angle of type ('H', 'C', 'C') is treated 
identically to ('C', 'C', 'H')
In [13]: tg2 = ag.angles[('3', '4', '6')]

In [14]: tg2
Out[14]: < TopologyGroup containing 5 angles >

# angle in radians of all these angles (done using Cython again)
In [15]: tg2.angles()
Out[15]: array([ 2.06366533,  2.05299721,  2.05026943,  2.04115112,  
2.05988778])

# Retrieve all bonds of a certain atom
In [16]: atom = u.atoms[25]

In [17]: tg3 = u.bonds[atom]

In [18]: tg3
Out[18]: < TopologyGroup containing 4 bonds >

In [19]: list(tg3)
Out[19]: 
[< Bond between: Atom 26 (Ch of polyami 1 None) and Atom 28 (H of polyami 1 
None), length 1.10 A>,
 < Bond between: Atom 23 (Ch of polyami 1 None) and Atom 26 (Ch of polyami 1 None), length 1.52 A>,
 < Bond between: Atom 26 (Ch of polyami 1 None) and Atom 27 (H of polyami 1 None), length 1.09 A>,
 < Bond between: Atom 26 (Ch of polyami 1 None) and Atom 29 (Ch of polyami 1 None), length 1.50 A>]

# Finally, can retrieve based on AtomGroup
In [20]: ag
Out[20]: <AtomGroup with 100 atoms>

# select all torsions from within this AtomGroup
# this selects only torsions where all 4 atoms are within the AtomGroup
In [21]: tg4 = u.torsions[ag]

In [22]: tg4
Out[22]: < TopologyGroup containing 39 torsions >

In [23]: ag2 = u.atoms[500:600]

In [24]: tg5 = u.torsions[ag2]

In [25]: tg5
Out[25]: < TopologyGroup containing 39 torsions >

# can combine similar to AtomGroups
In [26]: tg6 = tg4 + tg5

In [27]: tg6
Out[27]: < TopologyGroup containing 78 torsions >

** Other changes
Bond Angle and Torsion instances now use __slots__ in the class, seemed to 
improve performance (Universe creation time) for me.

** Further possible changes
-- Could remove .bonds .angles and .torsions attributes from Atom instances to 
save on memory space now that this information can be retrieved through 
TopologyGroup.selectBonds

Original issue reported on code.google.com by richardjgowers on 30 Aug 2014 at 11:34

GoogleCodeExporter commented 9 years ago

Hi Richard,

Overall your new topology framework sounds good. I particularly like how one 
can query topology informations from AtomGroup objects (but I am also confused, 
see below (2)). That fits really well into the overall object-oriented design. 
Hiding TopologyDict is also a Good Thing.

Concerns:

1) I really don't want to make loading a universe slower than it is already. 
Have you tried this with a 200,000 atom system with lots of water? How does it 
handle typical water models in particular? Could we make the generation of the 
master lists lazy?

2) I don't understand why you need to index u.bonds[ag] in order to get all the 
bonds that involve the atoms of ag. How is this different from ag.bonds ? If 
the difference is that the latter is only bonds WITHIN the group then I rather 
add a method such as ag.get_bonds() that returns either ag.bonds or 
u.bonds[ag], maybe based on a keyword internal=True|False. Similar get_angles() 
and get_torsions().

3) We should agree on using either degrees or radians in MDAnalysis; I think 
most functions and methods use degrees and this should all be consistent. (If 
speed is an issue and you don't want the deg2rad conversion, add a keyword such 
as unit="degrees" | "radians" or radians=False.)

4) What current code and tests do these changes break?

Don't add this code in develop right away but open a feature branch.

Oliver

Original comment by orbeckst on 31 Aug 2014 at 5:20

GoogleCodeExporter commented 9 years ago

Hi Oliver, thanks for the feedback.

To quickly clarify

re 1) Universe loading isn't longer with the changes, but loading topology in 
general does seem to take longer than it should. 
I'm going to profile and see where all our time is going, making master lists 
lazy is definitely possible.

2) ag.bonds and u.bonds[ag] are identical (this is how ag builds .bonds)
I think the problem here is that all AtomGroups build from u.bonds so if I did

ag = u.atoms[:100]
ag2 = ag[:10]

ag.bonds[ag2] 
would be much faster than
ag2.bonds 
as this would search u.bonds rather than the shorter ag.bonds.

This is another performance thing I'm looking into, ie is there some way to 
remember the parent AtomGroup so I can search through a smaller list.

3) When I wrote these I assumed that most other trig functions seemed to return 
rads, (eg np.sin).  I'll look through and see what the norm is within MDA, 
adding flags/keywords where necessary.

4) I did a git grep to try and see if this breaks anything else in the package, 
I don't think it does, but I'll double check this.

It will break anything old code that used the previous system (ag.bondDict 
ag.angleDict etc)
Also any code that used u.bonds as a list, this is easily fixed by using either 
list(u.bonds) or u._bondlist

Original comment by richardjgowers on 31 Aug 2014 at 9:07

GoogleCodeExporter commented 9 years ago

Ok pushed a new branch 'feature-improvedtopology'

Bond, angle and torsion information is built lazily into Universe, so the only 
difference in speed for building universes is only from parsing the extra lines 
in the .psf file.  This change should also make loading Universes much faster 
than it currently is.

I reworked fetching bonds based on AtomGroups, it's now called 
tg.atomgroup_intersection(ag).  This accepts a keyword 'strict' to make it only 
return bonds which are entirely in the AtomGroup.

ie

ag = u.atoms[20:30] # region of interest

tg = u.torsions[('C', 'Ca', 'C', 'C')] # torsion of interest

tg.bond_intersection(ag) # returns any torsions from this tg which at least 1 
atom from ag features in 

tg.bond_intersection(ag, strict=True) # returns any torsions where all 4 atoms 
are in ag

I don't think this breaks anything, I've updated+expanded tests; everything 
should pass.

Original comment by richardjgowers on 1 Sep 2014 at 11:40

GoogleCodeExporter commented 9 years ago

Original comment by richardjgowers on 17 Sep 2014 at 9:33

Changed state: Fixed

djay0529 / mdanalysis

Work on Topology system #194