Open dmclark17 opened 3 years ago
@evan-charmworks Any idea what could be going on here?
@dmclark17 I don't think we've seen this on any other machines, I'm not sure if we have access to a DGX A100, but I can try to reproduce this on a DGX-2. In the mean time, if you have time, isolating a commit with git bisect
may be helpful here. Also, do you know if this happens when not using a container?
Separately, in general we recommend using a multicore
build when running on a single node, as I assume you are doing here, so it might be worth trying that if this is inhibiting production runs.
@dmclark17 How many logical nodes does your job invocation run on?
I do not know if the issue is reproduced outside of a container, but I can experiment with that as well as git bisect
. As another data point, the simplearrayhello
test also seems to hang in the _initCharm
function when running in a container. I am using a single node right now but plan to scale to multiple nodes.
I am using 8 logical nodes for these tests
I am observing hangs with NAMD2.15alpha2 and Charm++ 7.0.0 on a DGX A100 node (dual AMD Rome 7742). The issue also happens with Charm++ v7.1.0-devel-89-ge24d2d3ad. I am using the following command to build charm++:
And running NAMD with the following command:
The stall seems to happen in
_initCharm
; here is the backtrace when I attach to one of the processes:This seems like the same issue as #2850, but it is not resolved with the development branch or using
+no_isomalloc_sync
. The issue is not reproduced with Charm++ 6.10.2. Please let me know if I can provide additional information. Thanks!