boscoh / pdbremix

analyse PDB files, run molecular-dynamics & analyse trajectories
MIT License
61 stars 25 forks source link

Andrey Tokarev ASA optimized #5

Closed Andrey-Tokarev closed 9 years ago

Andrey-Tokarev commented 9 years ago

Hi Bosco,

I was busy last days, finally I implemented your suggestions: snake case and a new name for the function "calculate_asa_optimized". There is another possible huge optimization: parallelization of this function. As I see it, it is almost linearly scalable. It could be done, for example, with "multiprocessing" package: https://docs.python.org/2/library/multiprocessing.html. If you are interested in that, when I have another free time I can do it.

Regards, Andrey

boscoh commented 9 years ago

Great stuff.

I wrote something similar for my molecular viewers - jolecule and pyball. It was to detect bonds in a PDB structure. I ended up using a spacehash and a 1 dimensional python array.array instead of using a 3-tuple based hash in a dictionary. I wonder what the speed differences are?

Parallelization seems interesting but I think it's pushing the bounds of how envisage the library. It's meant to be a simple 1-thread library that could be deployed easily (I use it on a cluster).

What's the paper you got published with it?

Bosco

Andrey-Tokarev commented 9 years ago

Thanks Bosco,

Regarding your first question. How did you implement a spacehash or is it a python package? And how do use array.array for the spacehash? My implementation is very lazy. What can be done to speed it up is to construct a hash function based on three coordinates rather than using default construction in Python dictionary. That might be significant improvement taking into account that default construction operating with strings (as I understand), while this one would operate with numbers which is a way faster. But in both cases, of course, complexity will be the same, O(n), such optimization will change constant at n, C*n, which is also very good, specially for practice.

My paper using ASA code was submitted to the journal. I am waiting response from reviewers. For you it may be interesting to know, that I am using ASA as a factor in probability of molecule to adsorb on atoms of platinum catalyst nanoparticle: the more available surface, the more probability to adsorb (other factors are adsorption energy and temperature). When it is published, I will send you a link / manuscript if you are interested.

Andrey

Andrey-Tokarev commented 9 years ago

Correction: after looking at the code it is seen that optimization of hash function (either constructing own, or using some other, like your spacehash) affects only one line before the loop on all atoms ("for i, atom_i in enumerate(atoms):") in calculate_asa_optimized, in particular it affects only adjacency_list() procedure. Taking into account that bottleneck of calculate_asa_optimized() is this loop, such optimization would not really save time :(

boscoh commented 9 years ago

I prototyped it in Python, which is included below.

This approach definitely saves memory - array.array is better than a list.

It depends whether calculating a hash a looking up one number is better than using a tuple as dictionary key. Think that'll make much difference?

import array
import math

class SpaceHash(object):

  def __init__(self, vertices, div=5.3, padding=0.05):
    self.vertices = vertices
    self.div = div
    self.inv_div = 1.0/self.div
    self.padding = padding

    zero3 = lambda: [0.0]*3
    self.minima = zero3()
    self.maxima = zero3()
    self.spans = zero3()
    self.sizes = zero3()

    for i in range(3):
      self.minima[i] = min([v[i] for v in self.vertices])
      self.maxima[i] = max([v[i] for v in self.vertices])
      self.minima[i] -= self.padding
      self.maxima[i] += self.padding
      self.spans[i] = self.maxima[i] - self.minima[i]
      self.sizes[i] = int(math.ceil(self.spans[i]*self.inv_div))

    self.size1_size2 = self.sizes[1]*self.sizes[2]
    self.size2 = self.sizes[2]

    self.cells = {}
    self.spaces = []
    for i_vertex, vertex in enumerate(self.vertices):
      space = self.vertex_to_space(vertex)
      self.spaces.append(space)
      space_hash = self.space_to_hash(space)
      cell = self.cells.setdefault(space_hash, array.array('L'))
      cell.append(i_vertex)

  def vertex_to_space(self, v):
    return [int((v[i] - self.minima[i])*self.inv_div) for i in range(3)]

  def space_to_hash(self, s):
    return s[0]*self.size1_size2 + s[1]*self.size2 + s[2]

  def neighbourhood(self, space):
    def neighbourhood_in_dim(space, i_dim):
      i = max(0, space[i_dim]-1)
      j = min(self.sizes[i_dim], space[i_dim] + 2)
      return range(i, j)
    for s0 in neighbourhood_in_dim(space, 0):
      for s1 in neighbourhood_in_dim(space, 1):
        for s2 in neighbourhood_in_dim(space, 2):
          yield [s0, s1, s2]

  def close_pairs(self):
    n_vertex = len(self.vertices)
    for i_vertex0 in range(n_vertex):
      space0 = self.spaces[i_vertex0]
      for space1 in self.neighbourhood(space0):
        hash1 = self.space_to_hash(space1)
        for i_vertex1 in self.cells.get(hash1, []):
          if i_vertex0 < i_vertex1:
            yield i_vertex0, i_vertex1

On Tue, Mar 3, 2015 at 9:16 PM, Andrey-Tokarev notifications@github.com wrote:

Correction: after looking at the code it is seen that optimization of hash function (either constructing own, or using some other, like your spacehash) affects only one line before the loop on all atoms ("for i, atom_i in enumerate(atoms):") in calculate_asa_optimized, in particular it affects only adjacency_list() procedure. Taking into account that bottleneck of calculate_asa_optimized() is this loop, such optimization would not really save time :(

— Reply to this email directly or view it on GitHub https://github.com/boscoh/pdbremix/pull/5#issuecomment-76919774.

Andrey-Tokarev commented 9 years ago

Hi Bosco,

I agree with you that array.array uses less memory than list. This could be implemented in ASA calculation: neighbor_indices and dictionary values can be realized via array.array. Because of Python garbage collector it will do affect instant memory usage; also it may affect peak memory usage if this part of the code uses maximum memory (bottleneck).

I looked at your prototype code and liked it. I am sure that "calculating a hash a looking up one number is better than using a tuple as dictionary key". But as I said it will not really affect performance of ASA calculation because in the loop (bottleneck) I am using neighbor_list which is a list, not a dictionary. Of course the best proof is an experiment (timing).

So, I think with your optimizations (1. array.array instead of list; 2. hash function instead of tuple) in this particular code (ASA calculation) you can save memory (1st optimization), not time (2nd optimization).

boscoh commented 9 years ago

Thanks for the analysis. Good fun to discuss. I am currently working on a protein secondary-structure analyser using Spacehash, which I will put into pdbremix.

On Fri, Mar 6, 2015 at 4:15 AM, Andrey-Tokarev notifications@github.com wrote:

Hi Bosco,

I agree with you that array.array uses less memory than list. This could be implemented in ASA calculation: neighbor_indices and dictionary values can be realized via array.array. Because of Python garbage collector it will do affect instant memory usage; also it may affect peak memory usage if this part of the code uses maximum memory (bottleneck).

I looked at your prototype code and liked it. I am sure that "calculating a hash a looking up one number is better than using a tuple as dictionary key". But as I said it will not really affect performance of ASA calculation because in the loop (bottleneck) I am using neighbor_list which is a list, not a dictionary. Of course the best proof is an experiment (timing).

So, I think with your optimizations (1. array.array instead of list; 2. hash function instead of tuple) in this particular code (ASA calculation) you can save memory (1st optimization), not time (2nd optimization).

— Reply to this email directly or view it on GitHub https://github.com/boscoh/pdbremix/pull/5#issuecomment-77406174.

http://boscoh.com

Andrey-Tokarev commented 9 years ago

By the way, I thought about it and I can implement these optimization some day. Even though they might be not that efficient overall, still it is good to have the better code. Right?

boscoh commented 9 years ago

That'd be great. I just added spacehash.py to the library. If you want, you can adapt that to pdbasa. If you think it's reliable, then you can make it the default option.

In terms of implementing the 'best' code: the best thing I can advise is work out if the effort would be worth it for what your most important goals are, first. Haha.

On Tue, Mar 10, 2015 at 11:24 PM, Andrey-Tokarev notifications@github.com wrote:

By the way, I thought about it and I can implement these optimization some day. Even though they might be not that efficient overall, still it is good to have the better code. Right?

— Reply to this email directly or view it on GitHub https://github.com/boscoh/pdbremix/pull/5#issuecomment-78043177.

http://boscoh.com