LLNL / zfp

Compressed numerical arrays that support high-speed random access
http://zfp.llnl.gov
BSD 3-Clause "New" or "Revised" License
772 stars 158 forks source link

`zfp` binary CUDA support broken? #178

Open tpwrules opened 2 years ago

tpwrules commented 2 years ago

I can't figure out how to test CUDA support on my system.

I generated a simple test file: python3 -c 'import struct; import math; open("test.bin", "wb").write(struct.pack("<256f", *list(math.sin(v/128*6.28)*1000 for v in range(256))))'.

Then I tried to compress it: zfp -x cuda -i test.bin -z testz.bin -f -1 256 -r 10 -s -h

Which reports the following info: type=float nx=256 ny=1 nz=1 nw=1 raw=1024 zfp=16 ratio=64 rate=0.5 rmse=707.3 nrmse=0.3536 maxe=1000 psnr=3.01

Sometimes it reports nan values or random large floats, despite being given the same input data, but the statistics suggest it's not actually compressing and just storing zeros (related to #105 ?). If I remove the -h flag then it just tells me "compression failed". Everything works fine with the other execution modes (serial and omp).

I'm trying to build and package the 1.0.0 release, which doesn't come with tests, so I'm not sure if CUDA is broken in general or this is specific to the command line tool. I am building against CUDA 11.6.

lindstro commented 2 years ago

I'm surprised that zfp -x cuda -h even works for you as the CUDA backend does not yet handle headers (-h) correctly (see #77). Using your command line, I'm getting this message:

% zfp -x cuda -i test.bin -z testz.bin -f -1 256 -r 10 -s -h
incorrect or missing header

If you drop -h, you should be seeing this:

% zfp -x cuda -i test.bin -z testz.bin -f -1 256 -r 10 -s 
type=float nx=256 ny=1 nz=1 nw=1 raw=1024 zfp=320 ratio=3.2 rate=10 rmse=0.9085 nrmse=0.0004542 maxe=2.733 psnr=60.83

which matches what you get with the serial backend:

% zfp -i test.bin -z testz.bin -f -1 256 -r 10 -s
type=float nx=256 ny=1 nz=1 nw=1 raw=1024 zfp=320 ratio=3.2 rate=10 rmse=0.9085 nrmse=0.0004542 maxe=2.733 psnr=60.83

I'm assuming that you have built with -DZFP_WITH_CUDA=ON; you should be getting another error message if you haven't.

Let me also add that it may be worthwhile running ctest -R Cuda on the develop branch, which runs a few dozen CUDA unit tests. Those tests are not included on the release branches.

The next zfp release will focus on improving CUDA and HIP support, including new/missing capabilities, performance improvements, as well as bug fixes.

tpwrules commented 2 years ago

Thank you for information on running the tests. They betrayed the root cause of my observed failures: Encode : 222 the provided PTX was compiled with an unsupported toolchain. Turns out I had built against CUDA 11.6 but the CUDA my system's driver was designed for was 11.4. I rebuilt against 11.4 and now the tests pass and the responses I get when I run the zfp commands I posted match yours.

This is usually not a problem due to CUDA's binary compatibility guarantees. However, this requires that the CUDA kernels be compiled to binary form for each extant compute capability, plus a PTX for future capabilities. zfp must only compile a PTX in the default case. I imagine a curated HPC software environment is much less likely to encounter this type of mismatch, but I would say it's pretty common in regular PC applications and scenarios especially when binaries are published.

Perhaps you can address this (along with improved error handling and propagation) as part of your planned GPU improvements.

lindstro commented 2 years ago

I agree that improved error reporting is a high priority. We've thought about something akin to errno, but for deep call stacks such single-error codes may not reveal much about the root cause of failure. Some kind of hierarchical error reporting would be preferable, but I'm not sure how to best achieve that in C. I'd welcome any suggestions.

Other than that, I'm not sure what zfp could/should do about issues external to the library itself, such as CUDA incompatibilities. Are you suggesting embedding information on what CUDA version zfp was built against and then checking for compatibility? That could be quite onerous in practice given dependencies on OpenMP, CUDA, HIP, SYCL, Cython, NumPy, etc., many of which are evolving faster than zfp itself. But I'm open to suggestions.

One thing that might help that we have partially implemented on a feature branch is GPU warmup, where a simple kernel is launched when the execution policy is set. If this fails, so does the zfp_stream_set_ececution() call. The primary purpose of this is to avoid lazy initialization by the CUDA runtime and to perform other expensive initialization calls only once (see, #135, for instance), but it could perhaps also be used for additional sanity checks.

tpwrules commented 2 years ago

The big thing on zfp's side in my book is to build multiple binary versions plus a PTX version of each CUDA kernel as described on that NVidia page I linked. That means that distributed zfp packages will be compatible with many graphics cards and driver versions. I'm not sure how to do that in your build system, but it's how the big CUDA-dependent packages (e.g. ML frameworks) are set up and works well. There shouldn't be any need to do runtime detection or selection, the driver and CUDA runtime will handle that if you make available appropriate versions.

lindstro commented 2 years ago

I must admit to not being familiar with the issues, and it would take some time to go through the NVIDIA documentation to figure out what needs to be done. Do you know of any projects where this is handled that we could use as a template?

tpwrules commented 2 years ago

I did some research and very recent versions of CMake know how to handle this automatically: https://cmake.org/cmake/help/v3.24/prop_tgt/CUDA_ARCHITECTURES.html#prop_tgt:CUDA_ARCHITECTURES . I understand requiring these recent versions is not an attractive proposition but it is a reference for the proper logic in CMake which is license-compatible. I would encourage defining CUDA_ARCHITECTURES=all by default. Defining more architectures allows more target-specific optimizations and increased performance, defining less reduces binary size and compilation time (though neither is at all significant now compared to the big ML frameworks!). If not enough are defined users simply won't be able to use zfp (and current versions will act quite mysteriously).

For now, it is sufficient to use something like -DCMAKE_CUDA_FLAGS= to pass the required flags to the CUDA compiler. There are various online references such as here (though I think its description of the -arch flag is wrong); I confirmed that adding all the -gencode options listed under "Sample flags for generation on CUDA 11.0" allows my scenario of building against CUDA 11.6 but running against driver version 470, which shipped with CUDA 11.4, to work correctly.

The defaults for these flags and the allowed and suggested combinations depend on the CUDA runtime built against and what capabilities your code needs. CMake's new options know how to handle these factors best. Since you support a wide variety of CUDA versions, I'm not sure it would be wise to explicitly list these flags. Maybe it would be good for now to simply reference some information in the documentation to allow people to select for best performance in their particular environment. I imagine an HPC administrator for instance would want to select all the architectures for hardware known to be installed in their machines to get maximum performance, instead of letting the CUDA compiler pick arbitrarily and constantly paying JIT costs and potentially being unable to take full advantage of their system.

Once you are comfortable mandating CMake 3.23, then the default experience can be nicer as a good set of architectures will be selected by default and the interested administrator can easily list compute capabilities instead of having to cross-reference the compiler documentation and manage supported flags directly.

lindstro commented 2 years ago

Thanks for sharing. We will be taking a closer look over the coming weeks. I don't think we're ready to mandate CMake 3.23 yet, but we could perhaps support this via CMake conditionals.

Just to be sure I understand the use case, compiling for different CUDA architectures makes sense for distributing binaries (e.g., RPMs, Spack packages) and when building and installing zfp on file systems shared by multiple architectures. If you're building zfp from source for a particular architecture, then you'd end up paying a penalty by inflating binaries and startup time for no benefit, so the default ought to be to build only for the current architecture.

When does the JIT compilation occur? At load time or when the first CUDA kernel is launched? All gets baked into the same binary, right? I imagine something similar would be needed for HIP and SYCL.

tpwrules commented 2 years ago

NVIDIA's documentation here answers these questions. Note that what they call "cubin" the CMake docs call "real" and what they call "PTX" CMake calls "virtual". They say all kernel versions are baked into the same binary, and object loading/possible JIT compilation occurs when a particular CUDA kernel is first launched. Not sure how different kernels are treated. Never used HIP or SYCL so I can't provide advice there, sorry. I presume these would always have to JIT.

What seems to be the case but they don't say there (and what was the root cause of this bug) is that PTX versions are specific to the CUDA library version. If the PTX compiled by a newer CUDA library version needs to be loaded by a driver shipped with an older library version, the JIT may fail (though the other way around will always work). CMake's native setting compiles real architecture code for the current GPU(s) so this is not a problem, but it is worth mentioning.

But I do want to say that depending on host details like that upsets reproducibility and makes me unhappy as a packager. I understand the motivations for making native the default, though if you do so, this fact should be called out loudly in the documentation.

I did some testing and compared to using whatever the CUDA compiler defines as the default architecture, passing the 9 architecture flags I linked earlier inflates compilation time by 2.5x (21s to 54s) and the size of libzfp.so by 4.4x (1.2MB to 5.3MB). There is no runtime cost for adding more architectures. Compared to the big ML frameworks where building only the native architecture saves hours and hundreds of megabytes, and compared to the risk of someone trying to move binaries across machines and things breaking unexpectedly and the benefit of reproducibility, making native the default is very small absolute savings and IMHO a false economy. A developer rebuilding the package constantly can easily set the CMake flags to native to save their time if necessary.

robertu94 commented 4 months ago

I think that the fixes in this patch will also address this https://github.com/LLNL/zfp/pull/232