Closed Andrey-Tokarev closed 9 years ago
Great stuff.
I wrote something similar for my molecular viewers - jolecule and pyball. It was to detect bonds in a PDB structure. I ended up using a spacehash and a 1 dimensional python array.array instead of using a 3-tuple based hash in a dictionary. I wonder what the speed differences are?
Parallelization seems interesting but I think it's pushing the bounds of how envisage the library. It's meant to be a simple 1-thread library that could be deployed easily (I use it on a cluster).
What's the paper you got published with it?
Bosco
Thanks Bosco,
Regarding your first question. How did you implement a spacehash or is it a python package? And how do use array.array for the spacehash? My implementation is very lazy. What can be done to speed it up is to construct a hash function based on three coordinates rather than using default construction in Python dictionary. That might be significant improvement taking into account that default construction operating with strings (as I understand), while this one would operate with numbers which is a way faster. But in both cases, of course, complexity will be the same, O(n), such optimization will change constant at n, C*n, which is also very good, specially for practice.
My paper using ASA code was submitted to the journal. I am waiting response from reviewers. For you it may be interesting to know, that I am using ASA as a factor in probability of molecule to adsorb on atoms of platinum catalyst nanoparticle: the more available surface, the more probability to adsorb (other factors are adsorption energy and temperature). When it is published, I will send you a link / manuscript if you are interested.
Andrey
Correction: after looking at the code it is seen that optimization of hash function (either constructing own, or using some other, like your spacehash) affects only one line before the loop on all atoms ("for i, atom_i in enumerate(atoms):") in calculate_asa_optimized, in particular it affects only adjacency_list() procedure. Taking into account that bottleneck of calculate_asa_optimized() is this loop, such optimization would not really save time :(
I prototyped it in Python, which is included below.
This approach definitely saves memory - array.array is better than a list.
It depends whether calculating a hash a looking up one number is better than using a tuple as dictionary key. Think that'll make much difference?
import array
import math
class SpaceHash(object):
def __init__(self, vertices, div=5.3, padding=0.05):
self.vertices = vertices
self.div = div
self.inv_div = 1.0/self.div
self.padding = padding
zero3 = lambda: [0.0]*3
self.minima = zero3()
self.maxima = zero3()
self.spans = zero3()
self.sizes = zero3()
for i in range(3):
self.minima[i] = min([v[i] for v in self.vertices])
self.maxima[i] = max([v[i] for v in self.vertices])
self.minima[i] -= self.padding
self.maxima[i] += self.padding
self.spans[i] = self.maxima[i] - self.minima[i]
self.sizes[i] = int(math.ceil(self.spans[i]*self.inv_div))
self.size1_size2 = self.sizes[1]*self.sizes[2]
self.size2 = self.sizes[2]
self.cells = {}
self.spaces = []
for i_vertex, vertex in enumerate(self.vertices):
space = self.vertex_to_space(vertex)
self.spaces.append(space)
space_hash = self.space_to_hash(space)
cell = self.cells.setdefault(space_hash, array.array('L'))
cell.append(i_vertex)
def vertex_to_space(self, v):
return [int((v[i] - self.minima[i])*self.inv_div) for i in range(3)]
def space_to_hash(self, s):
return s[0]*self.size1_size2 + s[1]*self.size2 + s[2]
def neighbourhood(self, space):
def neighbourhood_in_dim(space, i_dim):
i = max(0, space[i_dim]-1)
j = min(self.sizes[i_dim], space[i_dim] + 2)
return range(i, j)
for s0 in neighbourhood_in_dim(space, 0):
for s1 in neighbourhood_in_dim(space, 1):
for s2 in neighbourhood_in_dim(space, 2):
yield [s0, s1, s2]
def close_pairs(self):
n_vertex = len(self.vertices)
for i_vertex0 in range(n_vertex):
space0 = self.spaces[i_vertex0]
for space1 in self.neighbourhood(space0):
hash1 = self.space_to_hash(space1)
for i_vertex1 in self.cells.get(hash1, []):
if i_vertex0 < i_vertex1:
yield i_vertex0, i_vertex1
On Tue, Mar 3, 2015 at 9:16 PM, Andrey-Tokarev notifications@github.com wrote:
Correction: after looking at the code it is seen that optimization of hash function (either constructing own, or using some other, like your spacehash) affects only one line before the loop on all atoms ("for i, atom_i in enumerate(atoms):") in calculate_asa_optimized, in particular it affects only adjacency_list() procedure. Taking into account that bottleneck of calculate_asa_optimized() is this loop, such optimization would not really save time :(
— Reply to this email directly or view it on GitHub https://github.com/boscoh/pdbremix/pull/5#issuecomment-76919774.
Hi Bosco,
I agree with you that array.array uses less memory than list. This could be implemented in ASA calculation: neighbor_indices
and dictionary values can be realized via array.array. Because of Python garbage collector it will do affect instant memory usage; also it may affect peak memory usage if this part of the code uses maximum memory (bottleneck).
I looked at your prototype code and liked it. I am sure that "calculating a hash a looking up one number is better than using a tuple as dictionary key". But as I said it will not really affect performance of ASA calculation because in the loop (bottleneck) I am using neighbor_list
which is a list, not a dictionary. Of course the best proof is an experiment (timing).
So, I think with your optimizations (1. array.array instead of list; 2. hash function instead of tuple) in this particular code (ASA calculation) you can save memory (1st optimization), not time (2nd optimization).
Thanks for the analysis. Good fun to discuss. I am currently working on a protein secondary-structure analyser using Spacehash, which I will put into pdbremix.
On Fri, Mar 6, 2015 at 4:15 AM, Andrey-Tokarev notifications@github.com wrote:
Hi Bosco,
I agree with you that array.array uses less memory than list. This could be implemented in ASA calculation: neighbor_indices and dictionary values can be realized via array.array. Because of Python garbage collector it will do affect instant memory usage; also it may affect peak memory usage if this part of the code uses maximum memory (bottleneck).
I looked at your prototype code and liked it. I am sure that "calculating a hash a looking up one number is better than using a tuple as dictionary key". But as I said it will not really affect performance of ASA calculation because in the loop (bottleneck) I am using neighbor_list which is a list, not a dictionary. Of course the best proof is an experiment (timing).
So, I think with your optimizations (1. array.array instead of list; 2. hash function instead of tuple) in this particular code (ASA calculation) you can save memory (1st optimization), not time (2nd optimization).
— Reply to this email directly or view it on GitHub https://github.com/boscoh/pdbremix/pull/5#issuecomment-77406174.
By the way, I thought about it and I can implement these optimization some day. Even though they might be not that efficient overall, still it is good to have the better code. Right?
That'd be great. I just added spacehash.py to the library. If you want, you can adapt that to pdbasa. If you think it's reliable, then you can make it the default option.
In terms of implementing the 'best' code: the best thing I can advise is work out if the effort would be worth it for what your most important goals are, first. Haha.
On Tue, Mar 10, 2015 at 11:24 PM, Andrey-Tokarev notifications@github.com wrote:
By the way, I thought about it and I can implement these optimization some day. Even though they might be not that efficient overall, still it is good to have the better code. Right?
— Reply to this email directly or view it on GitHub https://github.com/boscoh/pdbremix/pull/5#issuecomment-78043177.
Hi Bosco,
I was busy last days, finally I implemented your suggestions: snake case and a new name for the function "calculate_asa_optimized". There is another possible huge optimization: parallelization of this function. As I see it, it is almost linearly scalable. It could be done, for example, with "multiprocessing" package: https://docs.python.org/2/library/multiprocessing.html. If you are interested in that, when I have another free time I can do it.
Regards, Andrey