Closed bjpalmer closed 2 months ago
Yes. It should go away. I recall I had a PR for this as part of thread safety 6 years ago.
Unfortunately, ga_malloc() is used in a NWChem routine https://github.com/nwchemgit/nwchem/blob/master/src/util/ga_asymmetr.c
Should I simply replace ga_malloc()
with vanilla malloc()
?
You could probably still use it, if you want. I'd just propose getting rid of it inside GA. But same comment applies: does it look like it is doing anything for you inside NWChem?
Yes, we should replace with malloc.
One reason to stop using ga_malloc is that it creates a less-than-obvious dependency on MA, as noted in https://github.com/GlobalArrays/ga/pull/211. We either want to use regular malloc
or ARMCI_Malloc_local
.
I created a branch, feature/malloc, that replaces ga_malloc with malloc. The test suite mostly runs but there may be a few hangs. However, there also appear to be a few hangs in develop, so I need to track those down. I put the changes inside an ifdef preprocessor block, so it should be easy to switch back and forth from ga_malloc to malloc by uncommenting the #define USE_GA_MALLOC line in base.h. If anyone gets a chance, can you try this branch out and see if there are any performance differences?
I have just run the following tests on deception with OpenMPI 4.1.4 on the feature/malloc branch without any issue using MPI-PR.
srun --mpi=pmi2 -N 2 --tasks-per-node=3 global/testing/mir_perf2.x srun --mpi=pmi2 -N 2 --tasks-per-node=3 global/testing/perfmod.x srun --mpi=pmi2 -N 2 --tasks-per-node=3 global/testing/perf.x srun --mpi=pmi2 -N 2 --tasks-per-node=3 global/testing/perform.x
@edoapra, what compiler are you using? I tried running the mir_perf2.x test standalone and it still looks like it is hanging.
gcc
Which version? I'm using 8.1.0
the default 4.8.5 Let me try 8
The OpenMPI installation for the the combination gcc 8.1 is broken on deception
srun --mpi=pmi2 -N 2 --tasks-per-node=3 global/testing/mir_perf2.x
[dmi01.local:247150] OPAL ERROR: Unreachable in file pmix3x_client.c at line 112
[dmi01.local:247152] OPAL ERROR: Unreachable in file pmix3x_client.c at line 112
[dmi01.local:247151] OPAL ERROR: Unreachable in file pmix3x_client.c at line 112
--------------------------------------------------------------------------
The application appears to have been direct launched using "srun",
but OMPI was not built with SLURM's PMI support and therefore cannot
execute. There are several options for building PMI support under
SLURM, depending upon the SLURM version you are using:
version 16.05 or later: you can use SLURM's PMIx support. This
requires that you configure and build SLURM --with-pmix.
Versions earlier than 16.05: you must use either SLURM's PMI-1 or
PMI-2 support. SLURM builds PMI-1 by default, or you can manually
install PMI-2. You must then build Open MPI using --with-pmi pointing
to the SLURM PMI library location.
I thought you were running on constance.
I didn't know constance is still around
Yep, still is. I tried running on deception with gcc/9.4.0 and openmpi/4.1.4 and got the same hangs. I'll try gcc/4.8.5 and see what happens.
[edo@constance01 bp3.cons]$ module load gcc/8.1.0
[edo@constance01 bp3.cons]$ module load openmpi
The combination of compiler and MPI does not appear to be built yet
Try gcc/9.4.0
gcc/4.4.7 is busted on constance
Try gcc/9.4.0
On constance or deception?
deception
Works for me. Just a warning in the initialization phase
module purge
module load gcc/9.4.0
module load openmpi/4.1.4
../ga_feature_malloc/configure --with-mpi-pr
make
make check-ga
salloc -N 2
srun --mpi=pmi2 -N 2 --tasks-per-node=3 global/testing/mir_perf2.x
--------------------------------------------------------------------------
WARNING: There was an error initializing an OpenFabrics device.
Local host: dmi03
Local device: mlx5_0
--------------------------------------------------------------------------
--------------------------------------------------------------------------
WARNING: There was an error initializing an OpenFabrics device.
Local host: dmi03
Local device: mlx5_0
--------------------------------------------------------------------------
--------------------------------------------------------------------------
WARNING: There was an error initializing an OpenFabrics device.
Local host: dmi03
Local device: mlx5_0
--------------------------------------------------------------------------
--------------------------------------------------------------------------
WARNING: There was an error initializing an OpenFabrics device.
Local host: dmi01
Local device: mlx5_0
--------------------------------------------------------------------------
--------------------------------------------------------------------------
WARNING: There was an error initializing an OpenFabrics device.
Local host: dmi01
Local device: mlx5_0
--------------------------------------------------------------------------
--------------------------------------------------------------------------
WARNING: There was an error initializing an OpenFabrics device.
Local host: dmi01
Local device: mlx5_0
--------------------------------------------------------------------------
Performance of GA get, put & acc for 1-dimensional sections of array[1048576]
Local 1-D Array Section
section get put accumulate
bytes dim sec MB/s sec MB/s sec MB/s
8 1 .430D-06 .186D+02 .352D-06 .227D+02 .527D-06 .152D+02
72 9 .421D-06 .171D+03 .359D-06 .200D+03 .523D-06 .138D+03
128 16 .421D-06 .304D+03 .371D-06 .345D+03 .523D-06 .245D+03
648 81 .434D-06 .149D+04 .385D-06 .168D+04 .552D-06 .117D+04
2048 256 .460D-06 .446D+04 .436D-06 .470D+04 .608D-06 .337D+04
4608 576 .525D-06 .878D+04 .508D-06 .906D+04 .736D-06 .626D+04
7200 900 .580D-06 .124D+05 .585D-06 .123D+05 .862D-06 .835D+04
18432 2304 .944D-06 .195D+05 .995D-06 .185D+05 .140D-05 .132D+05
32768 4096 .141D-05 .233D+05 .144D-05 .228D+05 .205D-05 .160D+05
66248 8281 .259D-05 .256D+05 .253D-05 .261D+05 .363D-05 .183D+05
131072 16384 .456D-05 .287D+05 .456D-05 .288D+05 .673D-05 .195D+05
233928 29241 .796D-05 .294D+05 .871D-05 .269D+05 .119D-04 .197D+05
524288 65536 .190D-04 .276D+05 .207D-04 .253D+05 .252D-04 .208D+05
996872 124609 .343D-04 .291D+05 .337D-04 .295D+05 .498D-04 .200D+05
1548800 193600 .529D-04 .293D+05 .523D-04 .296D+05 .714D-04 .217D+05
2097152 262144 .707D-04 .296D+05 .697D-04 .301D+05 .963D-04 .218D+05
Remote 1-D Array Section
section get put accumulate
bytes dim sec MB/s sec MB/s sec MB/s
8 1 .315D-05 .254D+01 .845D-06 .947D+01 .974D-06 .821D+01
72 9 .295D-05 .244D+02 .849D-06 .848D+02 .982D-06 .733D+02
128 16 .303D-05 .422D+02 .853D-06 .150D+03 .989D-06 .129D+03
648 81 .386D-05 .168D+03 .101D-05 .642D+03 .111D-05 .584D+03
2048 256 .426D-05 .480D+03 .104D-05 .198D+04 .110D-05 .186D+04
4608 576 .506D-05 .911D+03 .115D-05 .402D+04 .118D-05 .390D+04
7200 900 .562D-05 .128D+04 .123D-05 .586D+04 .128D-05 .561D+04
18432 2304 .786D-05 .234D+04 .177D-05 .104D+05 .182D-05 .101D+05
32768 4096 .914D-05 .359D+04 .212D-05 .154D+05 .221D-05 .148D+05
66248 8281 .126D-04 .526D+04 .349D-05 .190D+05 .367D-05 .180D+05
131072 16384 .186D-04 .706D+04 .566D-05 .232D+05 .595D-05 .220D+05
233928 29241 .289D-04 .810D+04 .980D-05 .239D+05 .989D-05 .236D+05
524288 65536 .247D-03 .212D+04 .473D-04 .111D+05 .283D-04 .185D+05
996872 124609 .236D-03 .423D+04 .657D-04 .152D+05 .554D-04 .180D+05
1548800 193600 .195D-03 .796D+04 .485D-03 .319D+04 .248D-03 .625D+04
2097152 262144 .288D-03 .729D+04 .180D-03 .116D+05 .308D-03 .680D+04
All tests successful
I am using the incline
queue on deception.
srun -p incline -N 2
setenv MCA_btl_openib_allow_ib true
gets rid of the OpenFabrics warning
Okay, you must be setting something that I am not. I've loaded gcc/4.8.5 (I've also tried gcc/9.4.0) and openmpi/4.1.4, configured using your configure line and run using the submit script
source /etc/profile.d/modules.csh
module purge
module load gcc/4.8.5 module load openmpi/4.1.4 env | grep PATH module list
setenv MCA_btl_openib_allow_ib true srun -N 2 --tasks-per-node=3 mir_perf2.x
and the code still hangs.
I just built the develop branch and am running the test suite. That looks like it works. There appear to have been a lot of changes since I spun off the ga_malloc branch.
Okay, you must be setting something that I am not. I've loaded gcc/4.8.5 (I've also tried gcc/9.4.0) and openmpi/4.1.4, configured using your configure line and run using the submit script
!/bin/csh #SBATCH -t 01:30:00 #SBATCH -A panyaya #SBATCH -p incline #SBATCH -N 2 #SBATCH -n 6 #SBATCH -o ./test.out #SBATCH -e ./test.err
source /etc/profile.d/modules.csh
source ~/set_deception2 module purge
module load gcc/4.8.5 module load openmpi/4.1.4 env | grep PATH module list
setenv MCA_btl_openib_allow_ib true srun -N 2 --tasks-per-node=3 mir_perf2.x
and the code still hangs.
This script works for me and I am using the feature/malloc branch.
Do you happen to have some extra OpenMPI configuration under $HOME/.openmpi ?
Could you try to run the binary I have copied as /tmp/edo_mir_perf2.x on deception03 ?
Okay, I'm an idiot. I didn't pick up changes to feature/malloc that were made recently. It looks like everything is passing for me on deception. I'll see if it works on constance.
The full test suite passes for me on deception. It looks like the two-sided runtime passes on constance.
I have also been testing this branch with the full QA test suite in NWChem. The results are the same as with 5.8.2
The RMA runtime mostly passes on deception. The only failure is nbtestc.x, which has been consistently failing even with ga_malloc. I'm not sure if that is a problem with the MPI implementation of a problem on our side with how we handle non-blocking handles. At this point, @edoapra, @ajaypanyala, @jeffhammond do you have an issue with merging this into develop?
I have no issue against the merge. nbtestc.x works for me on deception with MPI3
There is already a pull request for this (#348). @jeffhammond, if you get a chance can you approve this and we will merge it into develop?
I had nontrivial feedback comments on that PR. I'm supportive but I think it can be improved. I'll make the changes if Edo is too busy.
I had nontrivial feedback comments on that PR. I'm supportive but I think it can be improved. I'll make the changes if Edo is too busy.
Could you describe the feedback your are talking about? I am not able to find any comment under your name in https://github.com/GlobalArrays/ga/pull/348
I don't know why they aren't visible. When I scroll down on https://github.com/GlobalArrays/ga/pull/348, I see
I thought about using MA_Sizeof but didn't so that we wouldn't be adding any additional MA dependencies in case we want to completely divorce GA from MA at some point in the future. I don't know if that is a priority for anyone. It's not a make or break item for me, so if anyone else feels strongly about it, let me know.
In that case, perhaps we can hide the case-switch inside of a GA_SIZEOF macro. I'm not wedded to MA, merely against code duplication.
I modified the code and defined a new function pnga_malloc that has the same signature as ga_malloc but can be switched under the hood to use either ga_malloc or malloc. This function is used everywhere that ga_malloc used to be used and that gets rid of the USE_GA_MALLOC symbol (except to convert between the two implementations of pnga_malloc). Somebody already created a GAsizeof function inside GA, so I just used that to handle GA types.
@ajaypanyala, @edoapra, @jeffhammond, any final comments on this branch? If not, then we can approve the pull request and merge this into develop.
I modified the code and defined a new function pnga_malloc that has the same signature as ga_malloc but can be switched under the hood to use either ga_malloc or malloc. This function is used everywhere that ga_malloc used to be used and that gets rid of the USE_GA_MALLOC symbol (except to convert between the two implementations of pnga_malloc). Somebody already created a GAsizeof function inside GA, so I just used that to handle GA types.
Where is this commit?
Oh duh, I forgot to check it in. Try it now.
Works for me
Looks good to me
Should we get rid of ga_malloc? It is used in a few places inside the GA code but not extensively. The main reason for getting rid of it would be to get rid of obscure errors that occur if you have not allocated a lot of memory using MA_Init. The original justification for using it was to improve performance by using pinned memory, but on the few occasions when I've swapped regular malloc for ga_malloc, I haven't seen any difference. @edoapra, @jeffhammond, what do you think?