Resampling a very big file with nbodykit

boryanah commented 3 years ago

Hi Yu Feng,

I am trying to resample a mesh stored in a bigfile from 2304^3 (40 GB) to 1152^3 and save it again to a bigfile mesh using a parallel script (on NERSC).

So far I have tried doing the following, granting the task 512 GB spread across 16 nodes, but got an out of memory error. Here is the script:

mesh = BigFileMesh(dens_dir+"density_%d.bigfile"%N_dim, mode='real', dataset='Field')
print("loaded bigfile")
mesh.save(dens_dir+"density_%d.bigfile"%N_dim_new, mode='real', dataset='Field', Nmesh=N_dim_new)

(The failure is after loading the bigfile. Also I have very very slightly modified the save function in the source code to allow the user to specify the new downsampled mesh-size Nmesh_new by adding Nmesh=Nmesh to the following line in the source code self.compute(mode,Nmesh=Nmesh) in the save function and have tested through a simple example that it works)

The failure of this line makes me think that the save step is perhaps not parallelized or perhaps that when resampling (i.e. Nmesh is not equivalent to the Nmesh of the original mesh) the self._paint_XXX function isn't.

A new approach that I intend to pursue is loading the bigfile mesh and converting it to a complex field through the paint command (as I believe that uses pfft) and then zeroing all modes that are higher than pi Nmesh_new/Lbox before reverting back to real space through the paint command (and saving to a bigfile the downsampled field). Is that approach parallelized? And if not what functions can I apply to make it so?

My nbodykit version is 0.3.15

rainwoodman commented 3 years ago

Hi,

The current resampling algorithm in pmesh is doing more or less what you wanted to do. It calls to https://github.com/rainwoodman/pmesh/blob/f8f1d78eb22027fbcf71345d382477cea25ab9e3/pmesh/pm.py#L479

What you do appears to be correct, but the memory is actually bit tight.

The resampling algorithm does use a lot of memory -- it wasn't very optimized. It was originally written to handle 128/256 meshes under extremely low memory load (in the over-decomposed regime) I think.

When I scanned the code in pmesh,, I see scratch spaces for r2c (1x), for the indices (2 to 4 long integers, 4x or 8x of floats), for global sorting (2x-ish), and c2r(1x), and output space (1x), and some masks that can probably be ignored.

So the 512 GB appears to be a bit tight. (as 13 * 40G = 520G).

As usually more nodes = faster processing time, if you can get 32 nodes, then give it a try. We can also try to curb the memory usage of resample, it is likely some of the objects can be eagerly freed with del, for example.

Another reason to run into OOM is that if you use too many ranks such that 1152x1152 is not divided by the process mesh; in that case some ranks will not receive any part of the mesh. This seems to be unlikely as you only have 16 nodes (and potentially 1664 = 32x32 ranks; how many ranks did you run the script with?), and 1152 = 128 9.

boryanah commented 3 years ago

Thank you -- that's good to know!

I am currently using I believe 64 ranks (2 tasks per node and 32 nodes):

#!/bin/bash -l                                                                                                                                                   
#SBATCH --tasks-per-node=2                                                                                                                                       
#SBATCH --nodes=32                                                                                                                                               
#SBATCH --mem=16GB                                                                                                                                                                                                                                                                                        
#SBATCH -q debug                                                                                                                                                 
#SBATCH -t 00:05:00                                                                                                                                              
#SBATCH -C haswell                                                                                                                                               
#SBATCH -J test-fields                                                                                                                                           
#SBATCH --mail-user=boryana.hadzhiyska@cfa.harvard.edu                                                                                                           
#SBATCH --mail-type=ALL                                                                                                                                          
conda activate desc

which mpirun

echo $SLURM_NTASKS

mpirun -n $SLURM_NTASKS python obtain_fields.py

rainwoodman commented 3 years ago

I am not super familiar with the desc environment.

Could you do a conda list? Is the the environment you built? mpirun on NERSC may not actually go across multiple nodes.

Did you check if the result of these commands make sense to you?

mpirun -n 64 hostname

and

mpirun -n 64 python -c 'from mpi4py import MPI; print(MPI.COMM_WORLD.rank)'

(A year ago they were giving unexpected results)

Running on NERSC computing nodes with MPI, I usually use the bcast based environment provided by m3035:

source /usr/common/software/m3035/conda-activate.sh 3.7
# optionally, add
# bcast-pip [path_to_sdist_zip_file_of_additional_package.zip ...]

srun -n 64 python obtain_fields.py

boryanah commented 3 years ago

Hi Yu Feng,

I will attempt to follow those instructions in more detail if I run into more issues in the future, but for now it seems like mpirun is working properly across nodes and outputs different host names!

I was more worried about the question of whether the code is properly parallelized: e.g. when I do manipulations on a bigfile mesh such as mesh.paint('real'), mesh.apply(filter), ArrayMesh(numpy_array), FFTPower(mesh), and mesh.save()? I am slightly confused by the documentation about which functions are and are not parallelized, but I may also be misreading!

Cheers, Boryana

rainwoodman commented 3 years ago

On Wed, Nov 11, 2020 at 10:20 PM boryanah notifications@github.com wrote:

Hi Yu Feng,

I will attempt to follow those instructions in more detail if I run into more issues in the future, but for now it seems like mpirun is working properly across nodes and outputs

different host names!

Good to know! I've seen right hostnames but wrong comm.rank values before; so definitely check that too.

I was more worried about the question of whether the code is properly

parallelized: e.g. when I do manipulations on a bigfile mesh such as mesh.paint('real'), mesh.apply(filter), ArrayMesh(numpy_array), FFTPower(mesh), and mesh.save()? I am slightly confused by the documentation about which functions are and are not parallelized, but I may also be misreading!

When we designed nbodykit, all functions were intended to work on data distributed to multiple ranks (parallized), unless specifically mentioned or we are hitting a bug / scaling bottleneck.

"preview()" was the only one that does not scale well that pops into my mind(visualization must be single headed). HaloCatalog used to be bad due to dependency on halotools, but that got fixed last year. Can't think of anything else right now..

Yu

Cheers, Boryana

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/bccp/nbodykit/issues/646#issuecomment-725865357, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABBWTCBJI3USRIHKPYXRX3SPN5EJANCNFSM4TL2WM3Q .

boryanah commented 3 years ago

Thanks a lot for your help! This is all very useful to know!

bccp / nbodykit

Resampling a very big file with nbodykit #646