charmplusplus / charm

The Charm++ parallel programming system. Visit https://charmplusplus.org/ for more information.
Apache License 2.0
203 stars 49 forks source link

Ibverbs hangs with leanmd #290

Closed pplimport closed 11 years ago

pplimport commented 11 years ago

Original author: Abhishek Gupta Original issue: https://charm.cs.illinois.edu/redmine/issues/290


Ibverbs hangs with leanmd:

Leanmd with ibverbs build of charm hangs while leanmd with net build charm works fine.

build of charm is ./build charm++ net-linux-x86_64 ibverbs --with-production -g The leanmd I use is from the latest one from git, the configure is 1) in def.h, change the line 21&22 to be

define PARTICLES_PER_CELL_START 990

define PARTICLES_PER_CELL_END 990

2) The command to run leanmd is ./charmrun ++nodelist nodelist +p16 ./leanmd 8 8 8 200 2000 2000 3) I run leanmd on 4 nodes of stampede using 4 cores/node, the submission script looks like

SBATCH -t 00:30:00

#

SBATCH -p development

#

SBATCH -N 4

#

SBATCH -n 16

#

SBATCH -J leanmd

pplimport commented 5 years ago

Original author: Abhishek Gupta Original date: 2013-09-12 20:17:35


Does not hang on net or mpi layer Does not hang with Randomized Queue

Hangs on Ibverbs even with RDMA disabled. Hangs on Ibverbs even with charm from around an year ago.

PhilMiller commented 5 years ago

Original date: 2013-09-16 19:54:43


Does this hang occur with ibverbs on the last stable release, v6.5.1?

pplimport commented 5 years ago

Original author: Abhishek Gupta Original date: 2013-09-17 18:35:38


Yes, it does. Xiang reported that earlier.

Actually today Nikhil and I tested it again, and it did not hang with 6.5.1. It hangs with the latest charm though.

nikhil-jain commented 5 years ago

Original date: 2013-09-20 16:11:43


Tracked the issue back to a Makefile change I made. Compilation of sockRoutines.c needed to be passed a compile time macro, that I lost in my changes to Makefile. However, why that causes hangs is still unclear. Have sent a mail to Orion/Gengbin to get their opinion on why a CmiTmp* buffer scheme was implemented in sockRoutines.

PhilMiller commented 5 years ago

Original date: 2013-09-24 17:02:04


Which commit to Makefile are you referring to? I'd like to understand what happened a bit better, and possibly see if other bugs are related to this.

nikhil-jain commented 5 years ago

Original date: 2013-09-24 17:10:03


commit id: 3795ee9b4d279ddd1d9c95ea3e270fd341df646c

I had modified the Makefile to generate the compilation command for converse related files using Make.depends. What I missed was that for compilation of sockRoutines, an environment variable was being defined at compile time, which enable use of different routines for memory allocation. I am still not sure of why this affects correction, but per Orion this has to do with handling of stacks. I am looking into it.

nikhil-jain commented 5 years ago

Original date: 2013-10-04 03:22:11


No further issue reported. Closing this issue.