Update buildkite, manifests, github action workflows

sriharshakandala commented 7 months ago

GPU tests on the CI seems to be taking much longer (https://buildkite.com/clima/rrtmgp-ci/builds/573#018d9080-12e0-43b2-bdfb-2e03fe406ff7) compared to the latest main (https://buildkite.com/clima/rrtmgp-ci/builds/570#018d8f6a-d60b-4422-81e4-f1c938044cef)

Sbozzolo commented 7 months ago

The buildkite pipeline had several problems. I fixed them and now most jobs are twice as fast.

The GPU unit test seems to be the only one adversely affected. @sriharshakandala, do you want to have a look at this?

https://buildkite.com/clima/rrtmgp-ci/builds/582#018d9acf-9053-433b-8a76-a0593b20f8d9

charleskawczynski commented 7 months ago

Changes overall look good to me, except a couple items in the project toml

Sbozzolo commented 7 months ago

I consildated the environments to only have perf (because that's the only one that is being run on buildkite)

Sbozzolo commented 7 months ago

@charleskawczynski do you have any idea what could be the reason behind this increase in time https://buildkite.com/clima/rrtmgp-ci/builds/592#018d9ed1-b8ac-4121-8118-2d3930baa764 compared to main?

It happens only on buildkite, @sriharshakandala ran the code on the cluster and found the same speed as main

sriharshakandala commented 7 months ago

@Sbozzolo : Please plan on including https://github.com/CliMA/RRTMGP.jl/pull/448 in this release.

Sbozzolo commented 7 months ago

I spent 3 more hours on this and I narrowed down the problem the CUDA updates. I can reproduce on the cluster on the P100 when I use CUDA 5.2, but it still fast when using CUDA 5.1.

Fast:

julia> CUDA.versioninfo()
CUDA runtime 12.2, local installation
CUDA driver 12.3
NVIDIA driver 535.54.3, originally for CUDA 12.2

CUDA libraries: 
- CUBLAS: 12.2.1
- CURAND: 10.3.3
- CUFFT: 11.0.8
- CUSOLVER: 11.5.0
- CUSPARSE: 12.1.1
- CUPTI: 20.0.0
- NVML: 12.0.0+535.54.3

Julia packages: 
- CUDA: 5.1.2
- CUDA_Driver_jll: 0.7.0+1
- CUDA_Runtime_jll: 0.10.1+0
- CUDA_Runtime_Discovery: 0.2.3

Toolchain:
- Julia: 1.10.0
- LLVM: 15.0.7

Preferences:
- CUDA_Runtime_jll.version: 12.2
- CUDA_Runtime_jll.local: true

1 device:
  0: Tesla P100-PCIE-16GB (sm_60, 15.893 GiB / 16.000 GiB available)

Slow:

CUDA runtime 12.2, local installation
CUDA driver 12.3
NVIDIA driver 535.54.3, originally for CUDA 12.2

CUDA libraries: 
- CUBLAS: 12.2.1
- CURAND: 10.3.3
- CUFFT: 11.0.8
- CUSOLVER: 11.5.0
- CUSPARSE: 12.1.1
- CUPTI: 20.0.0
- NVML: 12.0.0+535.54.3

Julia packages: 
- CUDA: 5.2.0
- CUDA_Driver_jll: 0.7.0+1
- CUDA_Runtime_jll: 0.11.1+0
- CUDA_Runtime_Discovery: 0.2.3

Toolchain:
- Julia: 1.10.0
- LLVM: 15.0.7

Preferences:
- CUDA_Runtime_jll.version: 12.2
- CUDA_Runtime_jll.local: true

1 device:
  0: Tesla P100-PCIE-16GB (sm_60, 15.893 GiB / 16.000 GiB available)

Only changes:

  [79e6a3ab] ↑ Adapt v3.7.2 ⇒ v4.0.1
  [052768ef] ↑ CUDA v5.1.2 ⇒ v5.2.0
  [0c68f7d7] ↑ GPUArrays v9.1.0 ⇒ v10.0.2
  [46192b85] ↑ GPUArraysCore v0.1.5 ⇒ v0.1.6
  [76a88914] ↑ CUDA_Runtime_jll v0.10.1+0 ⇒ v0.11.1+0

I also checked that using the system and the artifact runtime produce the same results.

@sriharshakandala do you want to take this on and investigate further?

charleskawczynski commented 6 months ago

I'm going to rebase this PR, cc @Sbozzolo

CliMA / RRTMGP.jl

Update buildkite, manifests, github action workflows #444