Closed anaphaze closed 2 years ago
Thank you very much for your extensive tests.
I have two comments.
git cherry-pick 980c2a656
might fix the problem. See https://github.com/3dem/relion/commit/980c2a656d5ca2a3bb80cc1152e542f2b189589d.Thanks -- after the current jobs finish, I'll do a "clean" 11.6 install and try your suggestions. Hopefully your second suggestion works because it sounds more future-proof for when CUDA12 / Hopper / Lovelace get released.
Remember newer hardware is backward-compatible with earlier binaries as long as the binary contains PTX (RELION does). You don't have to use the latest SDK for compilation. You needs the latest SDK only when you want to use the latest hardware features (, which RELION does not use anyway).
You cannot use older drivers for new cards, but you can use older SDKs for compilation.
Thanks for the additional advice on backwards compatability. I have tested both of your suggestions and they both work. The test environments are:
1. The machine has CUDA 11.6
2. RELION 3.0.8 & 3.1.2 built with CUDA 11.6 with modified 980c2a6
This patch works but the hash before the comment should be a double forward slash:
Build fails:
#define CUB_NS_QUALIFIER ::cub # for compatibility with CUDA 11.5
Build succeeds:
#define CUB_NS_QUALIFIER ::cub // for compatibility with CUDA 11.5
This line is in the RELION 3.1.3 and 4.0 beta code, but not in previous versions, which perfectly explains all the build failures in my original test.
Our of curiousity, I tested if there is forward compatability. The job failed, but I'll still leave the key error message here in case someone in the future accidentally runs code that is built on a newer CUDA SDK than they have on their machine: 3. The machine has CUDA 10.2
ERROR: CUDA driver version is insufficient for CUDA runtime version in /home/dbsganl/linux/relion-3.0.8_cub/src/ml_optimiser_mpi.cpp at line 128 (error-code 35)
This patch works but the hash before the comment should be a double forward slash:
Correct. We had to cherry-pick https://github.com/3dem/relion/commit/554e0ed993e5ac8a3fee4be7c5cf64a62216a8c7 as well (this fixes the "#" sign).
I installed CUDA 11 to use the newer A40 (Ampere) GPUs. To make a long story short, it seems that some versions of RELION cannot be built with CUDA 11.6, but are okay with 11.2. The following summarizes what combinations work and what don't:
Environment (Overall):
Environment (Build = SUCCESS):
Environment (Build = SUCCESS):
Environment (Build = FAILURE):
P.S. I know that RELION 3.0.8 is no longer supported, but I need to keep it functional for my lab's current workflows; it also serves as a "control" for any future workflows. Hopefully, this bug report is useful to anyone else who needs to maintain legacy software, especially when the older pre-Ampere GPUs are no longer available.
Error message: Based on my understanding of previous bug reports, the key line contains "CUB":
The full log is: