Is there any way to parallelize the computation of multiple convolutions?

WangYun1995 commented 1 year ago

Hi, First, I have a 3D density mesh (Nmesh=1536^3) obtained by using nbodykit, stored as a bigfile on my disk. Then I want to convolve the mesh with a band-pass filter at multiple scales, which is equivalent to the multiplication operation in Fourier domain. Finally, at each scale, I compute some binned statistic env_WPS of the convolved mesh, and my code snippet is shown below.

import numpy as np
from scipy import stats
from nbodykit.lab import BigFileMesh

def bandpass(scale):
   '''The band-pass filter in the Fourier space
   '''
   def filter( k, v ):
      cn       = 4.0*np.sqrt( np.sqrt(np.pi**3)/(9.0+55.0*np.e) )
      k2       = k[0]**2 + k[1]**2 + k[2]**2
      k2      /= scale**2 
      k        = np.sqrt(k2)
      filter_  = cn*( np.exp(k-0.5*k2)*(k2-k) + np.exp(-0.5*k**2-k)*(k2+k) )
      filter_ /= np.sqrt(scale**3)
      return v*filter_
   return filter

# Load the density mesh
path_mesh  = '/.../.../dens_field.bigfile'
dens_mesh  = BigFileMesh(path_mesh, 'Field')

# Parameters
Lbox  = dens_mesh.attrs['BoxSize'][0]    # Unit: Mpc/h
Nmesh = dens_mesh.attrs['Nmesh'][0]     # Integer
kf = 2*np.pi/Lbox
kNyq  = Nmesh*np.pi/Lbox
scales  = np.geomspace(kf, 0.4*kNyq, 25)
dens_bins = np.geomspace(0.1,100,7,endpoint=True) # The density bins

# Compute the volume fraction of each density environment
dens_m     = real_mesh_.compute(mode='real', Nmesh=Nmesh)
dens_m_    = np.ravel(dens_m)
dens_bins_ = np.pad( dens_bins, (1, 1), 'constant', constant_values=(0,np.amax(dens_m)) )
Nvol, _, _ = stats.binned_statistic(dens_m_, dens_m_, 'count', bins=dens_bins_)
fvol       = Nvol/Nmesh**3

# Initialze the env_WPS and global-WPS
env_WPS    = np.zeros( (len(scales), len(dens_bins)+1) )
global_WPS = np.zeros_like( scales )
# Perform the CWT and compute the env-WPS and global-WPS
for i, scale in enumerate(scales):
   cwt_mesh           = real_mesh.apply(bandpass(scale), mode='complex', kind='wavenumber')
   cwt_rmesh          = cwt_mesh.compute(mode='real', Nmesh=Nmesh)
   cwt2               = cwt_rmesh**2
   env_WPS[i,:], _, _ = stats.binned_statistic(dens_m_, np.ravel(cwt2), 'mean', bins=dens_bins_)
   global_WPS[i]      = np.sum( env_WPS[i,:]*fvol )

Since the above code uses only one core, it is inefficient in the case of a large Nmesh. So how to parallelize the code in the context of NBODYKIT?

rainwoodman commented 1 year ago

Running the code under mpirun. Each MPI process will get a part of the full density field.
I think 'stats.binned_statistic' would only compute the partial statistics on the part of the full density field that is local to this process. You will likely need to call mpi.COMM_WORLD.allreduce(...) on the result of stats.binned_statistic to add up the partial statistics (need to undo the mean first), and then recompute the mean.

WangYun1995 commented 1 year ago

Thank you for your nice reply, i will try what you said.

WangYun1995 commented 1 year ago

Hi Yu, The method you suggest does work. However, Parallelization is not going to be very efficient. For example, I use a mesh with Nmesh = 768**3 as a test. When using only one core, the time taken is 1407 seconds. When using 64 cores, the time taken is 722 seconds. When using 96 cores, the time taken is 611 seconds.
Obviously, turning on the parallelization only saves half the time， not 1407/64 or 1407/96 seconds. Why is this the case?

rainwoodman commented 1 year ago

Do you know which part of your program is taking most time?

Usually if your application is already bound by IO then the speed will depend on how much the file system can provide. Adding some prints of the time can help us find out.

Less critical is making sure cpu cores are not oversubscribed, by making sure no adequate MKL or OpenMP threads per task.

On Tue, Sep 19, 2023 at 11:48 PM WangYun1995 @.***> wrote:

Hi Yu, The method you suggest does work. However, Parallelization is not going to be very efficient. For example, I use a mesh with Nmesh = 768**3 as a test. When using only one core, the time taken is 1407 seconds. When using 64 cores, the time taken is 722 seconds. When using 96 cores, the time taken is 611 seconds. Obviously, turning on the parallelization only saves half the time， not 1407/64 or 1407/96 seconds. Why is this the case?

— Reply to this email directly, view it on GitHub https://github.com/bccp/nbodykit/issues/683#issuecomment-1727070743, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABBWTCH26BDOYSNHN65SMDX3KGT7ANCNFSM6AAAAAA4ZKYTSQ . You are receiving this because you commented.Message ID: @.***>

WangYun1995 commented 1 year ago

The operations inside the for loop consume the most time.

rainwoodman commented 1 year ago

Perhaps it is because the computation and disk IO that leads to dens_m is repeated for each iteration?

I recall there is a way to create a Mesh from dens_m (FieldMesh? ArrayMesh)? Cannot recall the name right away). perhaps replace the real density Mesh object used in the loop with that?

On Thu, Sep 21, 2023 at 10:06 PM WangYun1995 @.***> wrote:

The operations inside the for loop consume the most time.

— Reply to this email directly, view it on GitHub https://github.com/bccp/nbodykit/issues/683#issuecomment-1730807085, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABBWTE4L6DOLIE5WJDUPA3X3UME7ANCNFSM6AAAAAA4ZKYTSQ . You are receiving this because you commented.Message ID: @.***>

WangYun1995 commented 1 year ago

This really improves the efficiency. I will acknowledge you in my new paper.

rainwoodman commented 1 year ago

Glad to hear it worked. Thank you!

On Sat, Sep 23, 2023 at 12:53 AM WangYun1995 @.***> wrote:

This really improves the efficiency. I will acknowledge you in my new paper.

— Reply to this email directly, view it on GitHub https://github.com/bccp/nbodykit/issues/683#issuecomment-1732246357, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABBWTFBS2ESZ4IBIFSKXR3X32IPRANCNFSM6AAAAAA4ZKYTSQ . You are receiving this because you commented.Message ID: @.***>

bccp / nbodykit

Is there any way to parallelize the computation of multiple convolutions? #683