StanfordLegion / legion

The Legion Parallel Programming System
https://legion.stanford.edu
Apache License 2.0
668 stars 146 forks source link

Cray compiler static linking is broken #1717

Closed elliottslaughter closed 1 month ago

elliottslaughter commented 1 month ago

This is not a Legion bug. I am tracking it here so everyone is aware, but there are no action items for anyone here associated with this.

Recent Cray compilers I have tested seem to be incapable of static linking. I've observed this on both Perlmutter and Frontier. I have demonstrated this with a trivial hello world program:

Perlmutter (PrgEnv-gnu is default, so does not need to be loaded):

$ cat test_static.c
int main(int argc, char **argv) {
  return 0;
}
$ export CRAYPE_LINK_TYPE=static
$ cc test_static.c -o test_static
/usr/lib64/gcc/x86_64-suse-linux/12/../../../../x86_64-suse-linux/bin/ld: cannot find -lcupti: No such file or directory
/usr/lib64/gcc/x86_64-suse-linux/12/../../../../x86_64-suse-linux/bin/ld: cannot find -lcudart: No such file or directory
/usr/lib64/gcc/x86_64-suse-linux/12/../../../../x86_64-suse-linux/bin/ld: cannot find -lcuda: No such file or directory
/usr/lib64/gcc/x86_64-suse-linux/12/../../../../x86_64-suse-linux/bin/ld: cannot find -lpmi: No such file or directory
/usr/lib64/gcc/x86_64-suse-linux/12/../../../../x86_64-suse-linux/bin/ld: cannot find -lpmi2: No such file or directory
collect2: error: ld returned 1 exit status

Frontier:

$ cat test_static.c
int main(int argc, char **argv) {
  return 0;
}
$ module load PrgEnv-gnu
$ export CRAYPE_LINK_TYPE=static
$ cc test_static.c -o test_static
/usr/lib64/gcc/x86_64-suse-linux/12/../../../../x86_64-suse-linux/bin/ld: cannot find -lz: No such file or directory
/usr/lib64/gcc/x86_64-suse-linux/12/../../../../x86_64-suse-linux/bin/ld: cannot find -lz: No such file or directory
collect2: error: ld returned 1 exit status

I have filed the following issues with the respective facilities:

What this means practically is that I am not sure static linking is a serious use case at these facilities. Maybe this is just as well since it causes us some trouble anyway.

Modules below for posterity.

Modules on Perlmutter ``` $ module list Currently Loaded Modules: 1) craype-x86-milan 2) libfabric/1.15.2.0 3) craype-network-ofi 4) xpmem/2.6.2-2.5_2.38__gd067c3f.shasta 5) gcc-native/12.3 6) perftools-base/23.12.0 7) cpe/23.12 8) cudatoolkit/12.2 9) craype-accel-nvidia80 10) gpu/1.0 11) craype/2.7.30 (c) 12) cray-dsmml/0.2.2 13) cray-mpich/8.1.28 (mpi) 14) cray-libsci/23.12.5 (math) 15) PrgEnv-gnu/8.5.0 (cpe) Where: mpi: MPI Providers cpe: Cray Programming Environment Modules math: Mathematical libraries c: Compiler ```
Modules on Frontier ``` $ module list Currently Loaded Modules: 1) craype-x86-trento 2) libfabric/1.15.2.0 3) craype-network-ofi 4) perftools-base/23.12.0 5) xpmem/2.6.2-2.5_2.40__gd067c3f.shasta 6) cray-pmi/6.1.13 7) Core/24.00 8) tmux/3.2a 9) hsi/default 10) lfs-wrapper/0.0.1 11) DefApps 12) emacs/28.1 13) gcc-native/12.3 14) craype/2.7.31 15) cray-dsmml/0.2.2 16) cray-mpich/8.1.28 17) cray-libsci/23.12.5 18) PrgEnv-gnu/8.5.0 19) darshan-runtime/3.4.0-mpi ```
lightsighter commented 1 month ago

I was going to ask this in the meeting but didn't want to cause a digression: is it even possible to statically link libcuda? I'm pretty sure the answer is 'no' because there is no way to statically link against the CUDA driver. There is only libcuda.so; there's no such thing as libcuda.a. @muraj can say for sure.

Seems like there is probably a bigger issue with cray compilers since a bunch of those other libraries are not libcuda, but just figured I would point this out in case it was the issue with not being able to statically link cuda.

muraj commented 1 month ago

is it even possible to statically link libcuda

Correct, it is not. There is no libcuda.a, only libcuda.so as it is effectively version locked with the kernel mode driver (some compatibility features withstanding). The same goes with cupti and many other libraries provided by the cuda toolkit (not cudart).

Likely the reason you cannot find these libraries is because they dont have static versions available. libs I don't think distributes a static version. I don't think this is a compiler issue but a setup issue. Using cmake would make this apparent immediately in the configure step.

I am not sure I understand the use case for static linking at this level. Realm is actually trying to move in a direction where we always dynamically inspect the capabilities of the system based on what its compiled features are and runs what it finds. Statically linking everything runs counter to this, so I'd like to understand this use case more.

elliottslaughter commented 1 month ago

The use case is running on systems our users need to run on.

Historically, there were systems that either required or strongly encouraged static linking. Cray systems were a typical example; CRAYPE_LINK_TYPE=static used to be the default, not an opt-in option on these systems. Because it was the default, dynamic linking was often under-tested and broken in various and subtle ways, so even though it might work "in theory", you could easily end up in situations where things would break merely because you were doing things differently than everyone else on the system. I remember needing to explicitly invoke ld-linux.so on Piz Daint because their loader was broken; and that was a bug that was outstanding for literally years before they finally fixed it.

At the moment, things have swung in the other direction. Cray made CRAYPE_LINK_TYPE=dynamic the default in 2019. Since then apparently the static option has bitrotted. However, much of the official documentation that I can find still recommends setting CRAYPE_LINK_TYPE=static when possible, presumably for the improvement in startup times. That documentation may be obsolete given my findings, but it still indicates some level of desire for static binaries.

I suppose if NVIDIA refuses to provide a static library option for CUDA, then by definition any system with CUDA will be forced to use dynamic linking. Therefore, we probably do not need to be overly concerned with static binaries in this case.

However, before we go and rip out the ability to have static binaries across the board, I would want to check with our users, especially those close to the machine procurement process, and make sure we're not painting ourselves into a corner.

lightsighter commented 1 month ago

Cray used to require the static linking because they used to force you to build on the login nodes which had a different architecture and OS than the compute nodes of their machines. They had a super minimalist stripped down Unix-based OS on the compute nodes with the thinking being that they would maximize performance by stripping away all the "cruft" of the OS and ensure that there were no file system dependences other than the binary you were using for running and the data you were loading/storing. The OS on the compute nodes was so minimalist that you couldn't even ssh to them normally. Eventually they realized this was dumb because it made it harder for users to use the machine and also was incompatible with GPUs and they've since transitioned away from requiring static linking. I at least don't know anyone else that requires static linking, but I agree it would be good to check.

muraj commented 1 month ago

Yeah, that sounds like an extremely limiting system, and it does run counter to the one build package idea that we're trying to go for with realm if we can't use dlopen on these machines. They will need to build and link in realm themselves and yeah, cuda support would not be an option.

elliottslaughter commented 1 month ago

Here's the response from NERSC:

static linking is not officially supported on NERSC systems anymore. The base OS doesn't support all edge cases, and many packages don't ship .a files anymore -- including CUDA. You can still build static apps in a limited sense. Eg if you set:

module unload cudatoolkit
module load cray-pmi

then your example compiles.

(It actually doesn't work for me, maybe I need to unload more modules to really get out of CUDA mode.)

muraj commented 1 month ago

So, is there any need to support full application static linking then?

elliottslaughter commented 1 month ago

The only reason in my view to support full application static linking is if the system requires it. At the moment, it appears that HPE/Cray have effectively deprecated static linking by making it non-default (and allowing the setting to bitrot, even without CUDA). HPE/Cray was historically the biggest champion of static linking, and so if they're not doing it, likely we won't see another machine that does. However, I'm trying to make sure we don't miss anything because going backwards on this decision will be harder if we have to back things out.

Unless you hear from me again, I think it's safe to assume that full application static linking is NOT something we need to support, and to proceed with that assumption.

muraj commented 1 month ago

Understood, I'll close this issue for now then.