[CI] upgrade torch to 2.3.0 and cuda to 12.1

Rhett-Ying commented 2 months ago

Description

Checklist

Please feel free to remove inapplicable items for your PR.

[ ] The PR title starts with [$CATEGORY] (such as [NN], [Model], [Doc], [Feature]])
[ ] I've leverage the tools to beautify the python and c++ code.
[ ] The PR is complete and small, read the Google eng practice (CL equals to PR) to understand more about small PR. In DGL, we consider PRs with less than 200 lines of core code change are small (example, test and documentation could be exempted).
[ ] All changes have test coverage
[ ] Code is well-documented
[ ] To the best of my knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change
[ ] Related issue is referred in this PR
[ ] If the PR is for a new model/paper, I've updated the example index here.

Changes

dgl-bot commented 2 months ago

To trigger regression tests:

@dgl-bot run [instance-type] [which tests] [compare-with-branch]; For example: @dgl-bot run g4dn.4xlarge all dmlc/master or @dgl-bot run c5.9xlarge kernel,api dmlc/master

dgl-bot commented 2 months ago

Commit ID: e409269a7d4b9102a3b3d31c30eff5ea5b7499fb

Build ID: 1

Status: ❌ CI test failed in Stage [CPU Build].

Report path: link

Full logs path: link

mfbalin commented 2 months ago

@dgl-bot

mfbalin commented 2 months ago

We can add -DCUDA_ARCH_NAME=Auto to reduce compilation times and reduce memory use in this file: https://github.com/dmlc/dgl/blob/9fde953d4bdb2a2d5ba4e878f31b032d46162920/tests/scripts/build_dgl.sh#L21

Hopefully, it will compile only for the GPU architecture present in the CI. If Auto somehow does not work, we can consider using Turing instead of Auto as the CI has a NVIDIA T4 GPU.

dgl-bot commented 2 months ago

Commit ID: c8b48e953303192ecba58d585988328708ef2d26

Build ID: 2

Status: ❌ CI test failed in Stage [GPU Build].

Report path: link

Full logs path: link

mfbalin commented 2 months ago

I guess the Auto is already the default flag. However, autodetection seems to be failing in the CI.

-- Running GPU architecture autodetection
nvcc warning : Cannot find valid GPU for '-arch=native', default arch is used
CMake Warning at cmake/modules/CUDA.cmake:84 (message):
  Running GPU detection script with nvcc failed:
Call Stack (most recent call first):
  cmake/modules/CUDA.cmake:161 (dgl_detect_installed_gpus)
  cmake/modules/CUDA.cmake:235 (dgl_select_nvcc_arch_flags)
  CMakeLists.txt:276 (dgl_config_cuda)

CMake Warning at cmake/modules/CUDA.cmake:89 (message):
  Automatic GPU detection failed.  Building for all known architectures
  (50;60;70;75;80;86;89;90).
Call Stack (most recent call first):
  cmake/modules/CUDA.cmake:161 (dgl_detect_installed_gpus)
  cmake/modules/CUDA.cmake:235 (dgl_select_nvcc_arch_flags)
  CMakeLists.txt:276 (dgl_config_cuda)

dgl-bot commented 2 months ago

Commit ID: 6d25ee7952b0666f8f41362ca98d62d29e04ec72

Build ID: 3

Status: ❌ CI test failed in Stage [GPU Build].

Report path: link

Full logs path: link

dgl-bot commented 2 months ago

Commit ID: 789a23c697b497151ab28078367ef5917a59a79b

Build ID: 4

Status: ❌ CI test failed in Stage [GPU Build].

Report path: link

Full logs path: link

dgl-bot commented 2 months ago

Commit ID: ff86d00a20408a1b0e69c43efd48f7c783a01977

Build ID: 5

Status: ❌ CI test failed in Stage [GPU Build].

Report path: link

Full logs path: link

dgl-bot commented 2 months ago

Commit ID: 543ae47f0f0be0f79bd83a47e061b7b782150cba

Build ID: 6

Status: ❌ CI test failed in Stage [GPU Build].

Report path: link

Full logs path: link

dgl-bot commented 2 months ago

Commit ID: 000cd893821540372930cfffe29e886600e06ab7

Build ID: 7

Status: ❌ CI test failed in Stage [GPU Build].

Report path: link

Full logs path: link

mfbalin commented 2 months ago

Can we pass TORCH_CUDA_ARCH_LIST to dgl_sparse and tensoradapter the same way we do for graphbolt? The most recent errors may be due to dgl_sparse. We can potentially refactor the logic to set TORCH_CUDA_ARCH_LIST from graphbolt so that it can be reused in dgl_sparse and tensoradapter.

dgl-bot commented 2 months ago

Commit ID: a283843f9b40d9d37560036f0d17f5eaedea2404

Build ID: 8

Status: ❌ CI test failed in Stage [CPU Build].

Report path: link

Full logs path: link

dgl-bot commented 2 months ago

Commit ID: b5748c11e4783fda867d0a5087892ac99a10d3fb

Build ID: 9

Status: ❌ CI test failed in Stage [GPU Build].

Report path: link

Full logs path: link

dgl-bot commented 2 months ago

Commit ID: 452b03765078e62b2c09c4b62974d1921563ece3

Build ID: 10

Status: ⚪️ CI test cancelled due to overrun.

Report path: link

Full logs path: link

dgl-bot commented 2 months ago

Commit ID: 5a5cf6e3cd042719d39f5aa24b67f85a188588b4

Build ID: 11

Status: ⚪️ CI test cancelled due to overrun.

Report path: link

Full logs path: link

dgl-bot commented 2 months ago

Commit ID: 5d16a859f7016d91be49a5c90473ce57c6231128

Build ID: 12

Status: ⚪️ CI test cancelled due to overrun.

Report path: link

Full logs path: link

dgl-bot commented 2 months ago

Commit ID: 9be3651d3cf42319b9d4011cc54e3acddb788458

Build ID: 13

Status: ❌ CI test failed in Stage [Torch CPU Unit test].

Report path: link

Full logs path: link

dgl-bot commented 2 months ago

Commit ID: 891e0fd7794c7a6afebad64ae202c32f574f2dee

Build ID: 14

Status: ❌ CI test failed in Stage [Distributed Torch CPU Unit test].

Report path: link

Full logs path: link

dgl-bot commented 2 months ago

Commit ID: 2003514dc45d32d205d506e65a310df041ba57a1

Build ID: 15

Status: ✅ CI test succeeded.

Report path: link

Full logs path: link

dmlc / dgl