chapel-lang / chapel

a Productive Parallel Programming Language
https://chapel-lang.org
Other
1.78k stars 419 forks source link

GPU support leaves some empty directories in `/tmp` #25631

Open ShreyasKhandekar opened 2 months ago

ShreyasKhandekar commented 2 months ago

There are two types of empty directories left over:

  1. Per Runtime Build: gpu-nvidia-cub-<pid>: This comes from a c++ file (gpu-nvidia-cub.cc)
  2. Per application build: rtmain-<pid>: this is our runtime, which happens to be a C++ compilation, as well.

Both of these files are compiled using clang -x cuda or clang -x hip This is a clang issue, not an issue with our GPU support. Therefore, we won't fix it.

I've submitted a bug report to the LLVM developers here: https://github.com/llvm/llvm-project/issues/100468

Here is a more detailed result of the investigation:

This happens with both nvidia and amd, for any file (CUDA/HIP/C++ ) Ex: A simple file with nothing CUDA or HIP specific:

// sample.cpp
#include <iostream>

int main() {
    std::cout << "Hello, World!" << std::endl;
    return 0;
}

For CUDA compile with:

clang++ -x cuda sample.cc

Or for HIP we compile with

clang++ -x hip sample.cc

We can see the temp files being created anew (looking for the newest files):

$ find /tmp -maxdepth 1 -type d -name "*sample*" -printf '%T+ %p\n' | sort -r | head -n 20
2024-07-22+13:25:32.9609698390 /tmp/sample-646dae
2024-07-22+13:25:32.9609698390 /tmp/sample-56e594
2024-07-22+13:25:11.3652665900 /tmp/sample-ad5032
2024-07-22+13:25:11.3652665900 /tmp/sample-1e8723
2024-07-22+13:24:44.7776319390 /tmp/sample-f98e7b
2024-07-22+13:24:44.7776319390 /tmp/sample-f459d6

I was able to reproduce this with LLVM 14 and 15.

Since this happens due to clang, the root cause for the per runtime build and per application build cases at the beginning of this post is the same (i.e. clang)

If we really wanted to work around this, we could specify a temp dir for clang++ and then clean that up manually post compilation.

ShreyasKhandekar commented 2 months ago

Based on the pointers that we got from https://github.com/llvm/llvm-project/issues/100468, I looked closer with different versions of clang and it looks like the issue doesn't exist in clang 17 and 18, but can be seen in clang 16 and earlier.

Running under strace with clang 16 or earlier I can see the mkdirs with no corresponding rmdirs:

❯ clang++  --version
clang version 16.0.6 (https://github.com/llvm/llvm-project.git 7cbf1a2591520c2491aa35339f227775f4d3adf6)
❯ strace -ff -o sample.log clang++  -x cuda sample.cc
clang-16: warning: CUDA version 11.8 is only partially supported [-Wunknown-cuda-version]

Looking for the tmp dirs that were left over:

❯ find /tmp -maxdepth 1 -type d -name "*sample*" -printf '%T+ %p\n' | sort -r | head -n 20
2024-07-25+16:40:23.6198345070 /tmp/sample-b7aeef
2024-07-25+16:40:23.6198345070 /tmp/sample-2350f5

And then looking into the strace log:

❯ grep "sample-b7aeef" sample.log*
sample.log.16658:mkdir("/tmp/sample-b7aeef", 0770)       = 0
sample.log.16658:access("/tmp/sample-b7aeef/sample-sm_35.o", W_OK) = -1 ENOENT (No such file or directory)
sample.log.16658:access("/tmp/sample-b7aeef/sample-sm_35.cubin", W_OK) = 0
sample.log.16658:stat("/tmp/sample-b7aeef/sample-sm_35.cubin", {st_mode=S_IFREG|0644, st_size=904, ...}) = 0
sample.log.16658:lstat("/tmp/sample-b7aeef/sample-sm_35.cubin", {st_mode=S_IFREG|0644, st_size=904, ...}) = 0
sample.log.16658:unlink("/tmp/sample-b7aeef/sample-sm_35.cubin") = 0
sample.log.16660:execve("/usr/local/cuda-11.8/bin/ptxas", ["/usr/local/cuda-11.8/bin/ptxas", "-m64", "-O0", "--gpu-name", "sm_35", "--output-file", "/tmp/sample-b7aeef/sample-sm_35."..., "/tmp/sample-2350f5/sample-sm_35."...], 0x7ffc9799c608 /* 102 vars */) = 0
sample.log.16660:openat(AT_FDCWD, "/tmp/sample-b7aeef/sample-sm_35.cubin", O_WRONLY|O_CREAT|O_TRUNC, 0666) = 3
sample.log.16661:openat(AT_FDCWD, "/tmp/sample-b7aeef/sample-sm_35.cubin", O_RDONLY) = 3

❯ grep "sample-2350f5" sample.log*
sample.log.16658:mkdir("/tmp/sample-2350f5", 0770)       = 0
sample.log.16658:access("/tmp/sample-2350f5/sample-sm_35.s", W_OK) = 0
sample.log.16658:stat("/tmp/sample-2350f5/sample-sm_35.s", {st_mode=S_IFREG|0644, st_size=89, ...}) = 0
sample.log.16658:lstat("/tmp/sample-2350f5/sample-sm_35.s", {st_mode=S_IFREG|0644, st_size=89, ...}) = 0
sample.log.16658:unlink("/tmp/sample-2350f5/sample-sm_35.s") = 0
sample.log.16659:stat("/tmp/sample-2350f5/sample-sm_35.s", 0x7ffdb034ad40) = -1 ENOENT (No such file or directory)
sample.log.16659:openat(AT_FDCWD, "/tmp/sample-2350f5/sample-sm_35-c1dbc0da.s.tmp", O_RDWR|O_CREAT|O_EXCL|O_CLOEXEC, 0666) = 4
sample.log.16659:rename("/tmp/sample-2350f5/sample-sm_35-c1dbc0da.s.tmp", "/tmp/sample-2350f5/sample-sm_35.s") = 0
sample.log.16660:execve("/usr/local/cuda-11.8/bin/ptxas", ["/usr/local/cuda-11.8/bin/ptxas", "-m64", "-O0", "--gpu-name", "sm_35", "--output-file", "/tmp/sample-b7aeef/sample-sm_35."..., "/tmp/sample-2350f5/sample-sm_35."...], 0x7ffc9799c608 /* 102 vars */) = 0
sample.log.16660:openat(AT_FDCWD, "/tmp/sample-2350f5/sample-sm_35.s", O_RDONLY) = 3
sample.log.16661:openat(AT_FDCWD, "/tmp/sample-2350f5/sample-sm_35.s", O_RDONLY) = 3

As I mentioned above, this issue seems to have already been resolved in clang 17 and later.

I will leave this issue open for a while since it already has the won't fix/ain't broke label, and in case another user runs into it. Although I doubt that will happen; even we stumbled upon this accidentally. The empty dirs don't take up space either so it really is a benign issue. For our case, once we update our system llvm/clang to be 17 or later, this should go away on its own.