Missing JSON objects for some components with CUDA major split archives

NVIDIA / build-system-archive-import-examples

Examples for importing precompiled binary tarball and zip archives into various build and packaging systems

MIT License

10 stars 5 forks source link

Missing JSON objects for some components with CUDA major split archives #6

Closed kmittman closed 4 months ago

kmittman commented 1 year ago

For example: cuDNN redistrib_8.9.1.23 has both CUDA 11.x and CUDA 12.x tarballs.

The JSON manifest includes references to

"relative_path": "cudnn/linux-x86_64/cudnn-linux-x86_64-8.9.1.23_cuda12-archive.tar.xz",

but is missing

"relative_path": "cudnn/linux-x86_64/cudnn-linux-x86_64-8.9.1.23_cuda11-archive.tar.xz",

Related to bin-archive v3 format item

Special cases for archive split

NVIDIA driver sub-components

by CUDA Major version (CUDA Minor version compatibility)

kmittman commented 1 year ago

I think the cleanest fix is to update the schema ...

For example option A with an array

    "version": "8.9.1.23",
    "linux-x86_64": [
      {
        "relative_path": "cudnn/linux-x86_64/cudnn-linux-x86_64-8.9.1.23_cuda11-archive.tar.xz",
        "sha256": "a6d9887267e28590c9db95ce65cbe96a668df0352338b7d337e0532ded33485c",
        "md5": "56a15f6a9b85b0be2f005a1e3715d506",
        "size": "903887852"
      },
      {
        "relative_path": "cudnn/linux-x86_64/cudnn-linux-x86_64-8.9.1.23_cuda12-archive.tar.xz",
        "sha256": "35163c5c542be0c511738b27e25235193cbeedc5e0e006e44b1cdeaf1922e83e",
        "md5": "fe41922f07a13da7b1593639adb0e32c",
        "size": "903519652"
      }
    ],

Or option B with a key

    "version": "8.9.1.23",
    "linux-x86_64": {
      "cuda11": {
        "relative_path": "cudnn/linux-x86_64/cudnn-linux-x86_64-8.9.1.23_cuda11-archive.tar.xz",
        "sha256": "a6d9887267e28590c9db95ce65cbe96a668df0352338b7d337e0532ded33485c",
        "md5": "56a15f6a9b85b0be2f005a1e3715d506",
        "size": "903887852"
      },
      "cuda12": {
        "relative_path": "cudnn/linux-x86_64/cudnn-linux-x86_64-8.9.1.23_cuda12-archive.tar.xz",
        "sha256": "35163c5c542be0c511738b27e25235193cbeedc5e0e006e44b1cdeaf1922e83e",
        "md5": "fe41922f07a13da7b1593639adb0e32c",
        "size": "903519652"
      }
    },

Unfortunately updating the schema will break existing scripts, such as the parse_redistrib.py one included in this repo.

Another open question is whether is makes sense to retroactively apply this or only going forward.

SomeoneSerge commented 1 year ago

Option A, if chosen, definitely needs a discriminator field, e.g.

    "version": "8.9.1.23",
    "linux-x86_64": [
      {
        "relative_path": "cudnn/linux-x86_64/cudnn-linux-x86_64-8.9.1.23_cuda11-archive.tar.xz",
        "cuda": "11",
        ...
      },
      {
        "relative_path": "cudnn/linux-x86_64/cudnn-linux-x86_64-8.9.1.23_cuda12-archive.tar.xz",
        "cuda": "12",
        ...
      }
    ],

Otherwise the official interface becomes "parse the URL to infer CUDA version". This applies to strings like "cuda12" to some extent as well

Is there any possibility that cudnn may later impose even more complex constraints?

leofang commented 1 year ago

I have a slight preference over the following approach, by adding an additional top level field cuda_ver (final name TBD) to indicate how many variants one should expect below:

    "version": "8.9.1.23",
    "cuda_ver": ["11", "12"],  # not sure if minor versions should be allowed here, TBD
    "linux-x86_64": {
      "cuda11": {
        "relative_path": "cudnn/linux-x86_64/cudnn-linux-x86_64-8.9.1.23_cuda11-archive.tar.xz",
        ...
      },
      "cuda12": {
        "relative_path": "cudnn/linux-x86_64/cudnn-linux-x86_64-8.9.1.23_cuda12-archive.tar.xz",
        ...
      }

then perhaps it is less important to pick between Option A & B. It's easier to parse this IMHO.

SomeoneSerge commented 1 year ago

I haven't any objections to @leofang's suggestion. As far as nixpkgs is concerned, I think it's equivalent to option B: we have several package set instances, e.g. cudaPackages_11 and cudaPackages_12, which choose a JSON manifest based on the cudatoolkit semver. We already select cudnn releases based on the cudatoolkit minor version, just using some pretty clumsy logic and compatibility tables that are maintained manually (copied over from the cudnn release notes).

With the proposed change I think we'd try to just pick a manifest attribute by name, be that "${cudaMajorVersion}" == "12" or "cuda${cudaMajorVersion}" == "cuda12", likely without looking at "cuda_ver"

Questions about "cuda_ver" I maybe do have is whether the order of the list is significant and, generally, what is the implied contract for the field.

On a related note, @kmittman what range of compatibility guarantees are "cuda11"/"cuda12" keys meant to suggest? The cudnn manual/release notes seem to only make promises about minor versions, not entire major versions. Should the manifests maybe also include explicit compatibility metadata? I'd be happy to delete https://github.com/NixOS/nixpkgs/blob/8f7c43426a2dc5dac9d8aaa4f616c6002ded891d/pkgs/development/libraries/science/math/cudnn/releases.nix if this was an option

kmittman commented 1 year ago

Both of you have made really good points, thank you @SomeoneSerge and @leofang very much! Need some time to ponder about the best option in general for: CMake, Conda, Nixpkgs, Debian, RPM, etc.

Regarding #2 I think I could inject min/max into the template, though TBH an accurate range would be difficult to maintain, each tarball file is tagged with some key-value metadata at creation time, then later parsed to generate the JSON manifests.

As far what that would look like? Here are some proposals

i. "cuda": { "min": "11.2.0", "max": "11.8.0" },
ii. "cuda": { "min": "12.0", "max": "12.9999" },
iii. "cuda": { "ge": "12.0.0", "lt": "13" },
iv. "minCudaVersion": "11.0", "maxCudaVersion": "11.8"
v. "cuda_min": "11", "cuda_max": "11"
vi. "depends": { "cuda": "11" },

Combining some of the current suggestions, wondering about something like

{
  "release_date": "2023-05-05",
  "release_label": "8.9.1",
  "cudnn": {
    "name": "NVIDIA CUDA Deep Neural Network library",
    "license": "cudnn",
    "license_path": "cudnn/LICENSE.txt",
    "version": "8.9.1.23",
    "cuda_ver": [
      "11",
      "12"
    ],
    "linux-x86_64": {
      "11": {
        "cuda_min": "11.2",
        "cuda_max": "11.8",
        "relative_path": "cudnn/linux-x86_64/cudnn-linux-x86_64-8.9.1.23_cuda11-archive.tar.xz",
        "sha256": "a6d9887267e28590c9db95ce65cbe96a668df0352338b7d337e0532ded33485c",
        "md5": "56a15f6a9b85b0be2f005a1e3715d506",
        "size": "903887852"
      },
      "12": {
        "cuda_min": "12",
        "cuda_max": "12",
        "relative_path": "cudnn/linux-x86_64/cudnn-linux-x86_64-8.9.1.23_cuda12-archive.tar.xz",
        "sha256": "35163c5c542be0c511738b27e25235193cbeedc5e0e006e44b1cdeaf1922e83e",
        "md5": "fe41922f07a13da7b1593639adb0e32c",
        "size": "903519652"
      }
    }
  }
}

ConnorBaker commented 1 year ago

Just chiming in, I would absolutely love if the manifest included compatible CUDA ranges — saves me from needing to maintain them elsewhere as Serge pointed out.

Is there anything I can do to assist?

kmittman commented 1 year ago

I'm planning to start implementation work on this. Have a few open questions @ConnorBaker , @SomeoneSerge , @leofang

The v3 schema is a breaking change, for existing v2 manifests a. Leave them alone b. Update them in-place c. Update them with another filename
Any other feedback about the min/max CUDA version? For RPM/Debian, we use deps based on libcudart.so.$cudaMajor
Once I have something working, I'll post a generated sample JSON manifest and work on updating the Python example in this repo

kmittman commented 1 year ago

Okay, here's what I've got redistrib_1.2.3.json

{
  "release_date": "2023-06-20",
  "release_label": "1.2.3",
  "release_product": "placeholder",
  "libplaceholder": {
    "name": "NVIDIA Placeholder",
    "license": "custom",
    "license_path": "libplaceholder/LICENSE.txt",
    "version": "1.2.3.4",
    "linux-x86_64": {
      "cuda12": {
        "relative_path": "libplaceholder/linux-x86_64/libplaceholder-linux-x86_64-1.2.3.4_cuda12-archive.tar.xz",
        "sha256": "e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855",
        "md5": "d41d8cd98f00b204e9800998ecf8427e",
        "size": "1156992"
      },
      "cuda11": {
        "relative_path": "libplaceholder/linux-x86_64/libplaceholder-linux-x86_64-1.2.3.4_cuda11-archive.tar.xz",
        "sha256": "01ba4719c80b6fe911b091a7c05124b64eeece964e09c058ef8f9805daca546b",
        "md5": "68b329da9893e34099c7d8ad5cb9c940",
        "size": "1126204"
      }
    },
    "cuda_variant": [
      "12",
      "11"
    ]
  }
}

and non-variant manifest remains mostly intact redistrib_0.1.0.json

{
  "release_date": "2023-06-20",
  "release_label": "0.1.0",
  "release_product": "foobar",
  "libfoobar": {
    "name": "NVIDIA Foo Bar",
    "license": "custom",
    "license_path": "libfoobar/LICENSE.txt",
    "version": "0.1.0.9",
    "linux-x86_64": {
      "relative_path": "libfoobar/linux-x86_64/libfoobar-linux-x86_64-0.1.0.9-archive.tar.xz",
      "sha256": "36a9e7f1c95b82ffb99743e0c5c4ce95d83c9a430aac59f84ef3cbfab6145068",
      "md5": "7215ee9c7d9dc229d2921a40e899ec5f",
      "size": "1743028"
    }
  }
}

kmittman commented 4 months ago

This has been implemented for awhile for a few releases of cuDNN and cuQuantum. Also I went back to the old releases and uploaded fixed manifests (new filenames). Closing.