multi-threaded distance_array()

GoogleCodeExporter commented 9 years ago

It would be nice to have a multi-threaded version of the distance_array() and 
self_distance_array() functions to speed up this time consuming calculation.

A possible starting point could be 

  Multi-threaded vectorized distance matrix computation on the CELL/BE and x86/SSE2 architectures. Adrianto Wirawan, Chee Keong Kwoh and Bertil Schmidt. Bioinformatics (2010) 26 (10): 1368-1369. doi: 0.1093/bioinformatics/btq135 

  http://bioinformatics.oxfordjournals.org/content/26/10/1368.full

Code is available at https://sourceforge.net/projects/distmatcomp/

Anyone wants to take this on as a nice little project in parallel programming? 
:-)

Original issue reported on code.google.com by orbeckst on 1 Oct 2011 at 9:15

GoogleCodeExporter commented 9 years ago

Now that's a cool ticket!

We're you thinking of following that particular implementation (ie c-based, 
exploiting the hardware details)? Anything i could write up in python using the 
multiprocessing module would probably give a boost but it would be nowhere near 
the low level described in the paper.

(I have to confess i follow the decomposition procedure only up to the stage of 
the minor diagonal ;))

Original comment by jan...@gmail.com on 9 Oct 2011 at 11:11

Added labels: ****
Removed labels: ****

GoogleCodeExporter commented 9 years ago

[deleted comment]

GoogleCodeExporter commented 9 years ago

I haven't really thought very hard about the implementation apart from the fact 
that it should be doable and that there's already optimised code out there to 
do something similar. If one can take the code by Wirawan et al, customize it, 
and add it in then that might be sufficient.

> And besides: is the main difference between distance_array() and 
self_distance_array() in the shape of the returned array?

self_distance_array() only computes and stores N*(N-1)/2 distances instead of 
N*M. See 
http://mdanalysis.googlecode.com/svn/trunk/doc/html/documentation_pages/analysis
/distances.html#MDAnalysis.analysis.distances.self_distance_array

Original comment by orbeckst on 10 Oct 2011 at 2:12

Added labels: ****
Removed labels: ****

GoogleCodeExporter commented 9 years ago

Original comment by orbeckst on 15 Nov 2011 at 2:13

Changed state: New
Added labels: ****
Removed labels: ****

GoogleCodeExporter commented 9 years ago

Original comment by orbeckst on 15 Nov 2011 at 2:13

Added labels: ****
Removed labels: ****

GoogleCodeExporter commented 9 years ago

Does that mean the ticket is up for grabs?

Original comment by jan...@gmail.com on 15 Nov 2011 at 2:19

Added labels: ****
Removed labels: ****

GoogleCodeExporter commented 9 years ago

Yes, I realised that the system automatically assigns tickets when a developer 
files one, and that is not always an accurate relflection on who's actually 
taking care of things :-). If you came up with something faster for 
distance_array() then that would be much appreciated.

Original comment by orbeckst on 15 Nov 2011 at 5:36

Added labels: ****
Removed labels: ****

GoogleCodeExporter commented 9 years ago

What are the sizes of coordinates were aiming for, in most cases?

Original comment by jan...@gmail.com on 16 Nov 2011 at 2:35

Added labels: ****
Removed labels: ****

GoogleCodeExporter commented 9 years ago

I've coded up a possible solution, but for systems between 1e2 and 1e4 
coordinates it is definitely slower than what we have now (serial C). This 
could be either because the mulitprocessing is slow or because i've messed up 
(guess, which one is more likely)

My approach to parallelization is as simple as it gets, for two coordinate 
arrays
  coordA (100,3)
  coordB (100,3)
I take the number of cores available, and slice the coordA into pieces, such 
that coordA gets divided almost equally between cores; I took care to share 
memory occupied by the slices, not copy. Then a scipy.weave (c-like, 
precompiled) method is called to calculate the distances between a slice of 
coordA and all of coordB.

The code and example performance check attached.

Maybe dividing the calculation in some other way (eg equal size squares) could 
help?

Original comment by jan...@gmail.com on 16 Nov 2011 at 3:18

Added labels: ****
Removed labels: ****

Attachments:

GoogleCodeExporter commented 9 years ago

Maybe you can profile the code (import cProfile...) and see where it spends 
time. Perhaps there's substantial setup/communication involved with 
multiprocessing and that kills it. Then I'd look at writing it in C using 
threads.

Original comment by orbeckst on 16 Nov 2011 at 2:00

Added labels: ****
Removed labels: ****

GoogleCodeExporter commented 9 years ago

(Jan, I set status to 'Started' and Jan as issue owner, feel free to change 
again to None if you get bored of it.)

Original comment by orbeckst on 16 Nov 2011 at 2:02

Changed state: Started
Added labels: ****
Removed labels: ****

GoogleCodeExporter commented 9 years ago

I'm afraid i won't have time to look into that any more, sorry

Original comment by jan...@gmail.com on 17 Dec 2011 at 2:10

Added labels: ****
Removed labels: ****

GoogleCodeExporter commented 9 years ago

(up for grabs, glory awaits)

Original comment by orbeckst on 6 Feb 2012 at 5:09

Changed state: New
Added labels: ****
Removed labels: ****

GoogleCodeExporter commented 9 years ago

If anybody is still interested/looking for a starting point, cython.parallel 
module looks pretty simple and easy 
http://docs.cython.org/src/userguide/parallelism.html

Original comment by jan...@gmail.com on 13 Jul 2012 at 3:46

Added labels: ****
Removed labels: ****

GoogleCodeExporter commented 9 years ago

Putatively resolved in

http://code.google.com/p/mdanalysis/source/detail?r=390b6e838ebf87967bc8f2a78574
e73c39c24e0f&name=develop

Benchmarking timeit script attached. Parallel cython implementation is fastest, 
serial cython implementation is slowest; parallel cython implementation gets 
faster as the matrix dimensions increase (Explained below) . 

A. For coord((1000,3), float32), ~2x improvement
Performance for serial pure C distance_array
1.55131959915
Performance for cython distance_array (serial)
2.40466713905
Performance for cython distance_array (parallel)
0.63991060257

B. For coord((10000,3), float32), ~4x improvement
Performance for serial pure C distance_array
11.014444828
Performance for cython distance_array (serial)
13.3849320412
Performance for cython distance_array (parallel)
2.80142903328

The performance is improved as the matrix dimensions increase.

Original comment by jan...@gmail.com on 15 Jul 2012 at 3:30

Added labels: ****
Removed labels: ****

Attachments:

distance_array_performance.py

GoogleCodeExporter commented 9 years ago

Hi Jan – looking good.

I moved it elsewhere (MDAnalysis.core.parallel.distances) instead of 
MDAnalysis.analysis.cython.distances. Other accelerated/parallel algorithms 
could be put under parallel as well. In the future we could use core.flags to 
switch between serial and parallel versions. I also made a separate UnitTest 
file test_parallel.py (which passes – I really appreciated your nicely 
written test!)

We don't need a separate "cython" directory (after all, that's just one way of 
accelerating code). I also moved the pxd file to src/numtools (all non-py code 
lives under src/) and integrated it fully with the cython logic in setup.py (so 
that the C-code is used when cython is not available). 

Your parallel.distances.distance_array() does not yet support the same 
arguments as the serial version (box for PBC and array to be filled) – if you 
add this then we could offer the parallel version as a drop-in replacement for 
the serial core.distances.distance_array() one.

Good stuff!

Thanks,
Oliver

Original comment by orbeckst on 17 Jul 2012 at 11:11

Changed state: Started
Added labels: ****
Removed labels: ****

GoogleCodeExporter commented 9 years ago

Hi Oli, 

Actually i started with the package structure you've proposed but then I 
thought this may make it look too commited ;p

Thanks for all the kind words - it's a pleasure. Not often does one find a 
group putting emphasis on testing. 

The flag idea sounds great; additionally, I'll conform to the serial 
distance_array() 'signature'. 

Thanks!
Jan

Original comment by jan...@gmail.com on 3 Aug 2012 at 3:17

Added labels: ****
Removed labels: ****

GoogleCodeExporter commented 9 years ago

Jan,

once you're code supports the same call signature as the serial code and passes 
the tests, please close this ticket. Open another one for implementing a flag 
that switches between serial and parallel code. This sounds like an interesting 
direction for MDAnalysis 0.8.

Thanks,
Oli

Original comment by orbeckst on 3 Aug 2012 at 6:46

Added labels: ****
Removed labels: ****

GoogleCodeExporter commented 9 years ago

Oli, quick question: calc_distances is using doubles for positions and floats 
for box/half_box. Is this a convention i should adhere to, or not?

Original comment by jan...@gmail.com on 4 Aug 2012 at 4:09

Added labels: ****
Removed labels: ****

GoogleCodeExporter commented 9 years ago

> calc_distances is using doubles for positions and floats for box/half_box. Is 
this a convention i should adhere to, or not?

I can't remember if there was a good reason. Keep it the same for the time 
being, at least then we know that the interface is exactly the same. If we 
decide to change it then we change it everywhere at once.

Original comment by orbeckst on 4 Aug 2012 at 5:21

Added labels: ****
Removed labels: ****

GoogleCodeExporter commented 9 years ago

Cancel that: core.distances.distance_array takes float32 coords and box - 
otherwise throws an error. i must have looked at an unused/wrong bit of code. 

Anyhow, my code now also expects float32 coordinates and box.

Added the tests and corrected the signatures, can close now.

Original comment by jan...@gmail.com on 7 Aug 2012 at 1:43

Added labels: ****
Removed labels: ****

GoogleCodeExporter commented 9 years ago

Original comment by orbeckst on 8 Aug 2012 at 5:11

Changed state: Verified
Added labels: ****
Removed labels: ****

dimchris / mdanalysis

multi-threaded distance_array() #80