GlobalArrays / ga

Partitioned Global Address Space (PGAS) library for distributed arrays
http://hpc.pnl.gov/globalarrays/
Other
101 stars 38 forks source link

Get rid of ga_malloc #330

Closed bjpalmer closed 2 months ago

bjpalmer commented 4 months ago

Should we get rid of ga_malloc? It is used in a few places inside the GA code but not extensively. The main reason for getting rid of it would be to get rid of obscure errors that occur if you have not allocated a lot of memory using MA_Init. The original justification for using it was to improve performance by using pinned memory, but on the few occasions when I've swapped regular malloc for ga_malloc, I haven't seen any difference. @edoapra, @jeffhammond, what do you think?

jeffhammond commented 4 months ago

Yes. It should go away. I recall I had a PR for this as part of thread safety 6 years ago.

edoapra commented 4 months ago

Unfortunately, ga_malloc() is used in a NWChem routine https://github.com/nwchemgit/nwchem/blob/master/src/util/ga_asymmetr.c

Should I simply replace ga_malloc() with vanilla malloc()?

bjpalmer commented 4 months ago

You could probably still use it, if you want. I'd just propose getting rid of it inside GA. But same comment applies: does it look like it is doing anything for you inside NWChem?

jeffhammond commented 4 months ago

Yes, we should replace with malloc.

jeffhammond commented 4 months ago

One reason to stop using ga_malloc is that it creates a less-than-obvious dependency on MA, as noted in https://github.com/GlobalArrays/ga/pull/211. We either want to use regular malloc or ARMCI_Malloc_local.

bjpalmer commented 3 months ago

I created a branch, feature/malloc, that replaces ga_malloc with malloc. The test suite mostly runs but there may be a few hangs. However, there also appear to be a few hangs in develop, so I need to track those down. I put the changes inside an ifdef preprocessor block, so it should be easy to switch back and forth from ga_malloc to malloc by uncommenting the #define USE_GA_MALLOC line in base.h. If anyone gets a chance, can you try this branch out and see if there are any performance differences?

edoapra commented 3 months ago

I have just run the following tests on deception with OpenMPI 4.1.4 on the feature/malloc branch without any issue using MPI-PR.

srun --mpi=pmi2 -N 2 --tasks-per-node=3 global/testing/mir_perf2.x srun --mpi=pmi2 -N 2 --tasks-per-node=3 global/testing/perfmod.x srun --mpi=pmi2 -N 2 --tasks-per-node=3 global/testing/perf.x srun --mpi=pmi2 -N 2 --tasks-per-node=3 global/testing/perform.x

bjpalmer commented 3 months ago

@edoapra, what compiler are you using? I tried running the mir_perf2.x test standalone and it still looks like it is hanging.

edoapra commented 3 months ago

gcc

bjpalmer commented 3 months ago

Which version? I'm using 8.1.0

edoapra commented 3 months ago

the default 4.8.5 Let me try 8

edoapra commented 3 months ago

The OpenMPI installation for the the combination gcc 8.1 is broken on deception

edoapra commented 3 months ago
srun --mpi=pmi2 -N 2 --tasks-per-node=3 global/testing/mir_perf2.x
[dmi01.local:247150] OPAL ERROR: Unreachable in file pmix3x_client.c at line 112
[dmi01.local:247152] OPAL ERROR: Unreachable in file pmix3x_client.c at line 112
[dmi01.local:247151] OPAL ERROR: Unreachable in file pmix3x_client.c at line 112
--------------------------------------------------------------------------
The application appears to have been direct launched using "srun",
but OMPI was not built with SLURM's PMI support and therefore cannot
execute. There are several options for building PMI support under
SLURM, depending upon the SLURM version you are using:

  version 16.05 or later: you can use SLURM's PMIx support. This
  requires that you configure and build SLURM --with-pmix.

  Versions earlier than 16.05: you must use either SLURM's PMI-1 or
  PMI-2 support. SLURM builds PMI-1 by default, or you can manually
  install PMI-2. You must then build Open MPI using --with-pmi pointing
  to the SLURM PMI library location.
bjpalmer commented 3 months ago

I thought you were running on constance.

edoapra commented 3 months ago

I didn't know constance is still around

bjpalmer commented 3 months ago

Yep, still is. I tried running on deception with gcc/9.4.0 and openmpi/4.1.4 and got the same hangs. I'll try gcc/4.8.5 and see what happens.

edoapra commented 3 months ago
[edo@constance01 bp3.cons]$ module load gcc/8.1.0
[edo@constance01 bp3.cons]$ module load openmpi
     The combination of compiler and MPI does not appear to be built yet
bjpalmer commented 3 months ago

Try gcc/9.4.0

edoapra commented 3 months ago

gcc/4.4.7 is busted on constance

edoapra commented 3 months ago

Try gcc/9.4.0

On constance or deception?

bjpalmer commented 3 months ago

deception

edoapra commented 3 months ago

Works for me. Just a warning in the initialization phase

module purge
module load gcc/9.4.0
module load openmpi/4.1.4
../ga_feature_malloc/configure --with-mpi-pr
make 
make check-ga
salloc  -N 2
srun --mpi=pmi2 -N 2 --tasks-per-node=3 global/testing/mir_perf2.x
--------------------------------------------------------------------------
WARNING: There was an error initializing an OpenFabrics device.

  Local host:   dmi03
  Local device: mlx5_0
--------------------------------------------------------------------------
--------------------------------------------------------------------------
WARNING: There was an error initializing an OpenFabrics device.

  Local host:   dmi03
  Local device: mlx5_0
--------------------------------------------------------------------------
--------------------------------------------------------------------------
WARNING: There was an error initializing an OpenFabrics device.

  Local host:   dmi03
  Local device: mlx5_0
--------------------------------------------------------------------------
--------------------------------------------------------------------------
WARNING: There was an error initializing an OpenFabrics device.

  Local host:   dmi01
  Local device: mlx5_0
--------------------------------------------------------------------------
--------------------------------------------------------------------------
WARNING: There was an error initializing an OpenFabrics device.

  Local host:   dmi01
  Local device: mlx5_0
--------------------------------------------------------------------------
--------------------------------------------------------------------------
WARNING: There was an error initializing an OpenFabrics device.

  Local host:   dmi01
  Local device: mlx5_0
--------------------------------------------------------------------------

 Performance of GA get, put & acc for 1-dimensional sections of array[1048576]

                          Local 1-D Array Section  
     section           get               put           accumulate
  bytes    dim     sec      MB/s     sec      MB/s     sec      MB/s
      8      1  .430D-06 .186D+02 .352D-06 .227D+02 .527D-06 .152D+02
     72      9  .421D-06 .171D+03 .359D-06 .200D+03 .523D-06 .138D+03
    128     16  .421D-06 .304D+03 .371D-06 .345D+03 .523D-06 .245D+03
    648     81  .434D-06 .149D+04 .385D-06 .168D+04 .552D-06 .117D+04
   2048    256  .460D-06 .446D+04 .436D-06 .470D+04 .608D-06 .337D+04
   4608    576  .525D-06 .878D+04 .508D-06 .906D+04 .736D-06 .626D+04
   7200    900  .580D-06 .124D+05 .585D-06 .123D+05 .862D-06 .835D+04
  18432   2304  .944D-06 .195D+05 .995D-06 .185D+05 .140D-05 .132D+05
  32768   4096  .141D-05 .233D+05 .144D-05 .228D+05 .205D-05 .160D+05
  66248   8281  .259D-05 .256D+05 .253D-05 .261D+05 .363D-05 .183D+05
 131072  16384  .456D-05 .287D+05 .456D-05 .288D+05 .673D-05 .195D+05
 233928  29241  .796D-05 .294D+05 .871D-05 .269D+05 .119D-04 .197D+05
 524288  65536  .190D-04 .276D+05 .207D-04 .253D+05 .252D-04 .208D+05
 996872 124609  .343D-04 .291D+05 .337D-04 .295D+05 .498D-04 .200D+05
1548800 193600  .529D-04 .293D+05 .523D-04 .296D+05 .714D-04 .217D+05
2097152 262144  .707D-04 .296D+05 .697D-04 .301D+05 .963D-04 .218D+05

                          Remote 1-D Array Section 
     section           get               put           accumulate
  bytes    dim     sec      MB/s     sec      MB/s     sec      MB/s
      8      1  .315D-05 .254D+01 .845D-06 .947D+01 .974D-06 .821D+01
     72      9  .295D-05 .244D+02 .849D-06 .848D+02 .982D-06 .733D+02
    128     16  .303D-05 .422D+02 .853D-06 .150D+03 .989D-06 .129D+03
    648     81  .386D-05 .168D+03 .101D-05 .642D+03 .111D-05 .584D+03
   2048    256  .426D-05 .480D+03 .104D-05 .198D+04 .110D-05 .186D+04
   4608    576  .506D-05 .911D+03 .115D-05 .402D+04 .118D-05 .390D+04
   7200    900  .562D-05 .128D+04 .123D-05 .586D+04 .128D-05 .561D+04
  18432   2304  .786D-05 .234D+04 .177D-05 .104D+05 .182D-05 .101D+05
  32768   4096  .914D-05 .359D+04 .212D-05 .154D+05 .221D-05 .148D+05
  66248   8281  .126D-04 .526D+04 .349D-05 .190D+05 .367D-05 .180D+05
 131072  16384  .186D-04 .706D+04 .566D-05 .232D+05 .595D-05 .220D+05
 233928  29241  .289D-04 .810D+04 .980D-05 .239D+05 .989D-05 .236D+05
 524288  65536  .247D-03 .212D+04 .473D-04 .111D+05 .283D-04 .185D+05
 996872 124609  .236D-03 .423D+04 .657D-04 .152D+05 .554D-04 .180D+05
1548800 193600  .195D-03 .796D+04 .485D-03 .319D+04 .248D-03 .625D+04
2097152 262144  .288D-03 .729D+04 .180D-03 .116D+05 .308D-03 .680D+04
 All tests successful 
edoapra commented 3 months ago

I am using the incline queue on deception.
srun -p incline -N 2

edoapra commented 3 months ago

setenv MCA_btl_openib_allow_ib true gets rid of the OpenFabrics warning

bjpalmer commented 3 months ago

Okay, you must be setting something that I am not. I've loaded gcc/4.8.5 (I've also tried gcc/9.4.0) and openmpi/4.1.4, configured using your configure line and run using the submit script

!/bin/csh

SBATCH -t 01:30:00

SBATCH -A panyaya

SBATCH -p incline

SBATCH -N 2

SBATCH -n 6

SBATCH -o ./test.out

SBATCH -e ./test.err

source /etc/profile.d/modules.csh

source ~/set_deception2

module purge

module load gcc/4.8.5 module load openmpi/4.1.4 env | grep PATH module list

setenv MCA_btl_openib_allow_ib true srun -N 2 --tasks-per-node=3 mir_perf2.x

and the code still hangs.

bjpalmer commented 3 months ago

I just built the develop branch and am running the test suite. That looks like it works. There appear to have been a lot of changes since I spun off the ga_malloc branch.

edoapra commented 3 months ago

Okay, you must be setting something that I am not. I've loaded gcc/4.8.5 (I've also tried gcc/9.4.0) and openmpi/4.1.4, configured using your configure line and run using the submit script

!/bin/csh #SBATCH -t 01:30:00 #SBATCH -A panyaya #SBATCH -p incline #SBATCH -N 2 #SBATCH -n 6 #SBATCH -o ./test.out #SBATCH -e ./test.err

source /etc/profile.d/modules.csh

source ~/set_deception2 module purge

module load gcc/4.8.5 module load openmpi/4.1.4 env | grep PATH module list

setenv MCA_btl_openib_allow_ib true srun -N 2 --tasks-per-node=3 mir_perf2.x

and the code still hangs.

This script works for me and I am using the feature/malloc branch.

Do you happen to have some extra OpenMPI configuration under $HOME/.openmpi ?

edoapra commented 3 months ago

Could you try to run the binary I have copied as /tmp/edo_mir_perf2.x on deception03 ?

bjpalmer commented 3 months ago

Okay, I'm an idiot. I didn't pick up changes to feature/malloc that were made recently. It looks like everything is passing for me on deception. I'll see if it works on constance.

bjpalmer commented 3 months ago

The full test suite passes for me on deception. It looks like the two-sided runtime passes on constance.

edoapra commented 3 months ago

I have also been testing this branch with the full QA test suite in NWChem. The results are the same as with 5.8.2

bjpalmer commented 3 months ago

The RMA runtime mostly passes on deception. The only failure is nbtestc.x, which has been consistently failing even with ga_malloc. I'm not sure if that is a problem with the MPI implementation of a problem on our side with how we handle non-blocking handles. At this point, @edoapra, @ajaypanyala, @jeffhammond do you have an issue with merging this into develop?

edoapra commented 3 months ago

I have no issue against the merge. nbtestc.x works for me on deception with MPI3

bjpalmer commented 3 months ago

There is already a pull request for this (#348). @jeffhammond, if you get a chance can you approve this and we will merge it into develop?

jeffhammond commented 3 months ago

I had nontrivial feedback comments on that PR. I'm supportive but I think it can be improved. I'll make the changes if Edo is too busy.

edoapra commented 3 months ago

I had nontrivial feedback comments on that PR. I'm supportive but I think it can be improved. I'll make the changes if Edo is too busy.

Could you describe the feedback your are talking about? I am not able to find any comment under your name in https://github.com/GlobalArrays/ga/pull/348

jeffhammond commented 2 months ago

I don't know why they aren't visible. When I scroll down on https://github.com/GlobalArrays/ga/pull/348, I see

Screenshot 2024-08-27 at 12 55 27 Screenshot 2024-08-27 at 12 55 33 Screenshot 2024-08-27 at 12 55 38 Screenshot 2024-08-27 at 12 55 43
bjpalmer commented 2 months ago

I thought about using MA_Sizeof but didn't so that we wouldn't be adding any additional MA dependencies in case we want to completely divorce GA from MA at some point in the future. I don't know if that is a priority for anyone. It's not a make or break item for me, so if anyone else feels strongly about it, let me know.

jeffhammond commented 2 months ago

In that case, perhaps we can hide the case-switch inside of a GA_SIZEOF macro. I'm not wedded to MA, merely against code duplication.

bjpalmer commented 2 months ago

I modified the code and defined a new function pnga_malloc that has the same signature as ga_malloc but can be switched under the hood to use either ga_malloc or malloc. This function is used everywhere that ga_malloc used to be used and that gets rid of the USE_GA_MALLOC symbol (except to convert between the two implementations of pnga_malloc). Somebody already created a GAsizeof function inside GA, so I just used that to handle GA types.

bjpalmer commented 2 months ago

@ajaypanyala, @edoapra, @jeffhammond, any final comments on this branch? If not, then we can approve the pull request and merge this into develop.

edoapra commented 2 months ago

I modified the code and defined a new function pnga_malloc that has the same signature as ga_malloc but can be switched under the hood to use either ga_malloc or malloc. This function is used everywhere that ga_malloc used to be used and that gets rid of the USE_GA_MALLOC symbol (except to convert between the two implementations of pnga_malloc). Somebody already created a GAsizeof function inside GA, so I just used that to handle GA types.

Where is this commit?

bjpalmer commented 2 months ago

Oh duh, I forgot to check it in. Try it now.

edoapra commented 2 months ago

Works for me

ajaypanyala commented 2 months ago

Looks good to me