Open tylerjereddy opened 6 months ago
Hi @tylerjereddy,
I suspect that the segfault is from somewhere in https://github.com/ofiwg/libfabric/blob/main/src/hmem_cuda_gdrcopy.c#L346-L380 or https://github.com/NVIDIA/gdrcopy/blob/master/src/gdrapi.c#L387-L411. Can you use gdb
to tell the exact line that this segfault is triggered?
For GDRCopy, you may want to change https://github.com/NVIDIA/gdrcopy/blob/master/src/Makefile#L29 to -O0 -g
so that it is friendlier with gdb. I guess that libfabric also has a similar compile options somewhere. Alternatively, you can manually instrument the code by adding printf
and narrow down where the segfault comes from.
I'm working on it, the exact nature of the failure is not deterministic, even between trials with the same builds it seems. I'll keep working on narrowing down for at least one of the failure scenarios. I'll also paste a few more example outputs that looked a little different (they didn't actually segfault, just error).
I see the max value of an unsigned 16-bit integer in one of the errors in there. Anyway, I'll try to dig deeper. My prints aren't showing up yet, so something I'm not understanding obviously.
Ah, completely purged out my custom install of gdrcopy
and the segfault/backtrace persisted, so it looks like some component in the dependency chain is ignoring the gdrcopy
that I ask NVSHMEM to use via GDRCOPY_HOME
when I build NVSHMEM from source.
I did confirm that I can see prints from my custom gdrcopy
install from fi_info
commands, for my custom libfabric build, but something in this backtrace isn't respecting the gdrcopy
version that I want to use for debugging, since it doesn't care if I build NVSHMEM with a different gdrcopy
:
Hi @tylerjereddy ,
I reviewed the NVSHMEM libfabric transport code. It does not use GDRCopy with Slingshot -- at least in NVSHMEM 2.10.1. However, libfabric itself (not NVSHMEM libfabric transport) uses GDRCopy. Based on the backtrace logs you posted, I think NVSHMEM calls into libfabric, which in turn triggers this issue. I think we can ignore NVSHMEM for now.
Guessing from your first comment, you originally ran with GDRCopy v2.3 and then moved to the master branch, right? Do you have root access on your system? Have you reloaded the gdrdrv driver from the master branch? If you have root access, can you enable debugging in the gdrdrv driver? After compiling GDRCopy, you can simply modify https://github.com/NVIDIA/gdrcopy/blob/master/insmod.sh#L28 to set dbg_enabled=0 info_enabled=0
and call sudo ./insmod.sh
. Please run sudo dmesg -w
on a separate shell. When you run your application and hit a GDRCopy error, you can see more lines in dmesg
. Please show me those lines.
ERR: error Cannot allocate memory(12) while mapping handle 92f4280, rounded_size=65536 offset=1fe380000
This line does not make sense to me. In most cases, the error code should be propagated from the gdrdrv driver. However, the driver never returns -ENOMEM (12) in the mmap path. And that line with that phrase can only be printed out from mmap
inside libgdrapi. One possibility is that ENOMEM is a stale error number from some other code paths. Before this line, can you add printf("ERRNO before calling mmap %d\n", errno);
? You can also reset errno = 0
before calling mmap
too.
I don't have root access, it is a supercomputer at LANL. I could perhaps try linking your suggestions for HPC support to see if there's anything they can check.
Have you reloaded the gdrdrv driver from the master branch?
I think the HPC admins are looking into your comment a bit, but I wanted to check on a few things:
gdrcopy
than the driver version available on the HPC machine?gdrcopy
version and/or driver? Or should LD_LIBRARY_PATH
allow me to easily swap gdrcopy
versions at runtime irrespective of how CXI was installed and the specific gdrcopy
driver version available?any risk that some problems arise because I'm building a newer gdrcopy than the driver version available on the HPC machine?
libgdrapi.so and gdrdrv (driver) are forward and backward compatible. Still, there might be some bugs we have fixed in a newer version of gdrdrv. It would be good to use the latest release version.
Your application talks to libgdrapi.so (not directly to gdrdrv). For this one, it is backward compatible only. For example, if you compile with GDRCopy v2.4, we cannot guarantee that your application will work with libgdrapi.so v2.3.
any risk that CXI proper (closed source HPE thing I think?) is somehow associated with an older gdrcopy version and/or driver? Or should LD_LIBRARY_PATH allow me to easily swap gdrcopy versions at runtime irrespective of how CXI was installed and the specific gdrcopy driver version available?
I don't know the answer. Is this a user-space library or a driver? If it is a user-space library, you probably can ldd <lib.so>
and see if it links with libgdrapi.so. It is possible that they use dlopen
. That will be more challenging to detect. If it is a driver, the answer is no. gdrdrv does not export any symbols. No other drivers can call into gdrdrv.
By the way, you may want to try setting use_persistent_mapping=1
on some systems. This is a gdrdrv module parameter. You set it when you load gdrdrv. I did not suggest this because the issues you encountered were during gdr_map
. Without that use_persistent_mapping=1
, you may run out of GPU BAR1. But an error should show up during gdr_pin_buffer
or when you call ibv_reg_mr
(from the IB stack). So, this parameter might be irrelevant, but you can try to set it if you plan to reload gdrdrv to enable the debug mode.
So, my debug prints were not showing up because prepending my custom gdrcopy
builds to LD_LIBRARY_PATH
was insufficient to override the gdrcopy
that was linked to ucx
, which in turn was linked to OpenMPI. That's pretty confusing, but for now swapping to a ucx at runtime that does not have gdrcopy
linked to it allows me to see my prints from another gdrcopy
loaded in LD_LIBRARY_PATH
.
Anyway, now I should be able to report some better debug prints.
More detailed debug analysis below, now that I can use custom gdrcopy
build with lower optimization level and interwoven prints. Keep in mind that the errors are not fully deterministic, so that does still make it a little tricky to drill down, but these analyses should be deeper than before at least.
err
block of gdr_unmap
and the backtrace has more info now, that ultimately seems to lead to gdr_unpin_buffer()
at https://github.com/NVIDIA/gdrcopy/blob/bb139287bfe4dd2566bc2d422af1a5082e51f353/src/gdrapi.c#L270 (since my debug prints change the line numbering a bit)err
block of gdr_unmap
, but this time it seems to hit a line of code in the closed-source HPE CXI library code:err
block of gdr_unmap
, and this time the final line reported in the backtrace is at https://github.com/NVIDIA/gdrcopy/blob/bb139287bfe4dd2566bc2d422af1a5082e51f353/src/gdrapi.c#L396 (because of the debug prints changing line nums)Does this give you any more traction to diagnose the problem? While I wait to hear back about the debug driver stuff, anything else you want me to try here? It also seems to me like there's a misunderstanding somewhere with UCX + gdrcopy
+ OpenMPI if my provider is actually CXI? I was originally asked to build OpenMPI linked to UCX + gdrcopy
+ CUDA.
Thank you @tylerjereddy. I suspect that you may run into a race condition from multithreading. GDRCopy, especially libgdrapi.so, is not thread safe. Anyway, I added a global lock to some functions in this branch https://github.com/NVIDIA/gdrcopy/tree/dev-issue-296-exp. Please try if it helps. You just need to recompile libgdrapi.so and use that. There is no need to install a new gdrdrv driver.
I still see errors that are not deterministic on that branch (I reduced the optimization level again as well).
Note that ERR: Error in pthread_mutex_init with errno=5296
occurs even on some simple fi_info
commands now, like fi_info -p cxi
.
Hanging seems more common on this branch now as well, and fi_info
commands seem slower. Perhaps not surprising if something isn't quite right with lock acquisition I suppose?
Sorry, there was a left-over code block. I just removed it. Please try again.
Note that this is not our final solution. It is just an adhoc implementation to see if it helps. It might not work if the caller calls a GDRCopy API with stale memory handle
. For example, if they call gdr_close
and then gdr_unmap
or gdr_unpin_buffer
, libgdrapi.so will access the memory handle object that has already been freed.
Here's the backtrace for the 2-node cuFFTMp reproducer with your updated branch (with optimization level reduced):
So, crash seems to be near here in your new branch, in _gdr_unpin_buffer
, while attempting to remove an element from a list:
https://github.com/NVIDIA/gdrcopy/blob/d2299254aff052ec1d29646f90d715589f5e0994/src/gdrapi.c#L281
Now, if we look at the special branch of libfabric
I'm using, in function cuda_gdrcopy_dev_unregister()
, which is called just before control flow returns to gdrcopy
proper, I see two calls that may be worth asking you about. First, there is a call to gdr_unmap()
, then right after that, there is a call to gdr_unpin_buffer()
. Both operate on the same handle/structure member it seems. Here is the permalink to that particular libfabric
branch/code block, which I think I needed for CXI support: https://github.com/thomasgillis/libfabric/blob/10caf878ccacedd2ce907e8e714a9d90d74d63ca/src/hmem_cuda_gdrcopy.c#L359-L368
The situation looks the same in the main
branch of libfabric
, for that particular block of code: https://github.com/ofiwg/libfabric/blob/f41cea52738da193fd312ce9cf0a1adf23acaa8f/src/hmem_cuda_gdrcopy.c#L359-L368
All of this code is in a libfabric
code block with a #if ENABLE_GDRCOPY_DLOPEN
preprocessor guard (or just after it..). I decided to mess around a little with that code block on the cxi-enabled branch of libfabric
using the diff below the fold.
Although deleting gdr_unpin_buffer
doesn't really protect me from backtraces/problems, print checkpoints 5 and 6 are hit regularly, suggesting non-zero exit codes returned regularly from gdr_unmap
on the special gdrcopy
branch. Is this more helpful? Is there anything in that libfabric
code block that could be safer/better?
@tylerjereddy Thank you for the additional info. We also call gdr_unpin_buffer
inside gdr_close
. But I don't expect to see segfault in LIST_REMOVE
if it comes from there. A few requests:
gdrcopy_copybw
and gdrcopy_sanity
work?Starting with your second point of the GDRCopy test applications, I used the latest master
branch of GDRCopy without modification and the dev-cxi
branch of libfabric
without modification (I'm guessing it didn't use libfabric
here, but just to be safe...)
The output is below, the sanity check seems to "pass" but spits out errors?
The modified interactive script for the 2-node test:
For the first point, using the latest version of dev-issue-296-exp
gdrcopy
branch with the original cuFFTMp 2-node reproducer, this is the raw output:
After that, I tried to do a bit more work. First, I added another print in _gdr_unpin_buffer
after the free
operation, because your print after the LIST_REMOVE
did show up in the failure scenario I pasted above.
On top of that, per the request to separate the output by node, I made a few more changes to the source to prefix the hostname in each of the prints. These changes are available on my fork of gdrcopy
(https://github.com/tylerjereddy/gdrcopy) on feature branch treddy_gh_296
(just builds a few commits on top of your branch).
Now, when I run the 2-node cuFFTMp reproducer with that version of gdrcopy
I see a double free or corruption (out)
error, apparently at the free()
call in _gdr_unpin_buffer()
. Full log:
out_improved_prints.txt
That would be consistent with your original instrumented code as well, with the list removal "succeeding," but the free
failing inside of _gdr_unpin_buffer
. I think you were already worried about a double free somewhere above.
I ran the reproducer two more times, and this was not always the case however--sometimes we get the printf
after the free
operation in _gdr_unpin_buffer()
and then the backtrace happens after that:
Of course, things are not fully deterministic, and I saw the double free error happening in what appears to be other parts of the control flow as well:
I'm guessing your team has already flushed the code through an address sanitizer at some point though? This is confusing! What can I do next to help get to the bottom of it?
There are multiple things that went wrong here. Let's start with the raw output from my instrumented code without your patch.
The output from the instrumented code is in [pid, tid]
format. I think the caller uses multiple threads here. I didn't see locking when I reviewed the libfabric code. We probably see some racing. But the experimental branch you are using should not have this problem because I added a global lock. So, I will remove racing inside GDRCopy from the discussion for now.
Let's stick some lines from the same process that reported segfault together. They should be in chronological order.
===> [119877, 119877] GDRCopy Checkpoint gdr_pin_buffer: 3: mh=0x7e1aa20
===> [119877, 119877] GDRCopy Checkpoint gdr_pin_buffer: 4: mh=0x7e1aa20, ret=0
===> [119877, 119877] GDRCopy Checkpoint gdr_map: 1: mh=0x7e1aa20
===> [119877, 119877] GDRCopy Checkpoint gdr_map: 2: mh=0x7e1aa20, ret=0
===> [119877, 119943] GDRCopy Checkpoint gdr_unmap: 1: mh=0x7e1aa20
===> [119877, 119943] GDRCopy Checkpoint gdr_unmap: 3: mh=0x7e1aa20
===> [119877, 119943] GDRCopy Checkpoint gdr_unmap: 5: mh=0x7e1aa20
===> [119877, 119943] GDRCopy Checkpoint gdr_unmap: 6: mh=0x7e1aa20, ret=0
===> [119877, 119943] GDRCopy Checkpoint _gdr_unpin_buffer: 1: mh=0x7e1aa20
===> [119877, 119943] GDRCopy Checkpoint _gdr_unpin_buffer: 2: mh=0x7e1aa20
===> [119877, 119877] GDRCopy Checkpoint gdr_unmap: 1: mh=0x7e1aa20
===> [119877, 119877] GDRCopy Checkpoint gdr_unmap: 2: mh=0x7e1aa20
===> [119877, 119877] GDRCopy Checkpoint gdr_unmap: 6: mh=0x7e1aa20, ret=22
==== backtrace (tid: 119877) ====
0 0x00000000000168c0 __funlockfile() ???:0
1 0x000000000000c59e cxil_unmap() /workspace/src/github.hpe.com/hpe/hpc-shs-libcxi/WORKSPACE/BUILD/libcxi-0.9/src/libcxi.c:945
2 0x00000000000a47cb cxip_unmap() :0
As shown, they were dealing with the same mh object based on the address. The caller called gdr_unmap
two times on the same object! What made it worse is that they called gdr_unmap
after _gdr_unpin_buffer
. The mh object had already been destroyed before the last gdr_unmap
was called. Note that mh
is directly translated from handle
that the caller passes to GDRCopy API. Basically, this is a use-after-free problem inside the caller.
===> [nid001196, 34247, 34247] GDRCopy Checkpoint gdr_unmap: 1: mh=0x7e1bf60
===> [nid001196, 34247, 34247] GDRCopy Checkpoint gdr_unmap: 3: mh=0x7e1bf60
===> [nid001196, 34247, 34247] GDRCopy Checkpoint gdr_unmap: 5: mh=0x7e1bf60
===> [nid001196, 34247, 34247] GDRCopy Checkpoint gdr_unmap: 6: mh=0x7e1bf60, ret=0
...
===> [nid001196, 34247, 34247] GDRCopy Checkpoint _gdr_unpin_buffer: 1: mh=0x14c509661a60
===> [nid001196, 34247, 34247] GDRCopy Checkpoint _gdr_unpin_buffer: 2: mh=0x14c509661a60
double free or corruption (out)
[nid001196:34247] *** Process received signal ***
[nid001196:34247] Signal: Aborted (6)
[nid001196:34247] Signal code: (-6)
The mh address that the caller passed to _gdr_unpin_buffer
was 0x14c509661a60
. That was completely in a different memory region as what the other mh objects were. In fact, I cannot find a single line printed from gdr_pin_buffer
that shows that this mh=0x14c509661a60
was created by GDRCopy. The caller probably passed in an unrelated object here.
So, suspicion would be on the libfabric
side, but not as far up the control flow as NVSHMEM?
IIUC, NVSHMEM does not use GDRCopy directly in that environment. I don't know the libfabric programming model. Is it thread safe? Does it require special handling from the libfabric caller (NVSHMEM in this case)? My suggestion is to move up one step at a time. Items 2 and 3 are clearly a mistake from GDRCopy's caller. Even if we make GDRCopy thread safe, you will still run into this segfault issue.
I think I've found evidence of a spin lock in libfabric
not having the right guards before unmapping and unpinning per the issue cross-listed above. I had to paste a bunch of internal gdrcopy
stuff to check if the unmapping had already happened in my sample patch there, and I still got crashes, but that particular pathology did seem to disappear when I hacked that in...
Looking at the log you posted in the libfabric issue 10041, you have
[112670, 112670] cuda_gdrcopy_dev_unregister() checkpoint 2 after spin lock and before unmap gdrcopy->mh=(nil)
===> [nid001252, 112670, 112670] GDRCopy Checkpoint gdr_unmap: 1: mh=(nil)
...
==== backtrace (tid: 112670) ====
0 0x00000000000168c0 __funlockfile() ???:0
1 0x00000000000023fb gdr_unmap() /lustre/scratch5/treddy/march_april_2024_testing/github_projects/gdrcopy/src/gdrapi.c:459
2 0x0000000000032e33 cuda_gdrcopy_dev_unregister() :0
3 0x00000000000a4bed cxip_unmap() :0
...
So, libfabric passes NULL to gdr_unmap
. That is likely the source of your segfault.
I think we agree on that, though I wasn't convinced that guarding against that was sufficient to fix all the problems, since I saw other backtraces after that was protected. I'm hoping to make another push at getting that working soon..
Working on Cray Slingshot 11, on 2 nodes with 4 x A100 each, with the test case from https://github.com/NVIDIA/CUDALibrarySamples/tree/master/cuFFTMp/samples/r2c_c2r_slabs_GROMACS, modified in this way to force multi-node NVSHMEM (
2.10.1
):I'm seeing the output/backtrace below the fold:
My full interactive run script is this, which will tell you a bit more about various dependency versions/paths:
More gruesome details about libfabric, CXI, CUDA support are described at https://github.com/ofiwg/libfabric/issues/10001, but since I'm apparently segfaulting in
gdrcopy
now, it may be helpful to determine what my next debugging steps should be here. I've already discussed things fairly extensively with the NVSHMEM team.I built the latest
gdrcopy
master
branch with gcc12.2.0
+cuda/12.0
"modules" loaded:make -j 32 prefix=/lustre/scratch5/treddy/march_april_2024_testing/gdrcopy_install CUDA=/usr/projects/hpcsoft/cos2/chicoma/cuda/12.0 all install
It would be awesome if I could get this working somehow. Note that I was originally getting different backtraces with
gdrcopy
2.3
.