CHIP-SPV / chipStar

chipStar is a tool for compiling and running HIP/CUDA on SPIR-V via OpenCL or Level Zero APIs.
Other
157 stars 27 forks source link

cucc is a fork bomb if a symlink called nvcc pointing to cucc exists #877

Closed zboszor closed 1 week ago

zboszor commented 2 weeks ago

I have a PyTorch build recipe in https://github.com/zboszor/meta-python-ai which is very Intel oriented at the moment, using MKL and oneDNN.

I also found that at least a few dependees of PyTorch (like yolov5 and ultralytics) have a very simple check to use GPU acceleration, deciding whether it's CUDA or CPU (a.k.a. "our way, or the highway"). It doesn't matter for them if PyTorch itself was built using Vulkan, OpenCL or ROCm.

So, I wanted to try to use chipStar to fake CUDA support in PyTorch, using the Yocto cross-compiler framework. To ease the process, I created a symlink called nvcc that points to cucc, among other things. The WIP changes are at https://github.com/zboszor/meta-python-ai/pull/3

PyTorch tries to deduce the details of CUDA using nvcc --version, which stalls and eventually makes the system go OOM.

I have looked at what it was doing and I found that the number of hipcc and python subprocesses grow ad infinitum, eventually depleting the RAM in the machine.

The process tree is cucc -> hipcc -> nvcc (which is actually cucc) -> hipcc -> ...

I think the chipStar fork of hipcc should check for whether nvcc is the same as cucc to avoid this infinite loop.

pvelesko commented 2 weeks ago

Try this branch:

https://github.com/CHIP-SPV/chipStar/pull/875

zboszor commented 2 weeks ago

Try this branch:

875

Does the C++ rewrite of cucc still allow printing the directory for a target build of libCHIP.so while using the native build of hipcc that's a cross-compiler for any target?

Current I have this against the python version of cucc to allow this: https://github.com/zboszor/meta-python-ai/blob/nanbield-chipstar-experiment/recipes-support/chipstar/chipstar_1.1.900.bb#L123 which is then exploited in the PyTorch recipe here: https://github.com/zboszor/meta-python-ai/blob/nanbield-chipstar-experiment/recipes-python/pytorch/python3-pytorch_2.3.1.bb#L293

zboszor commented 2 weeks ago

Also:

$ git show 04ffdd331dad08cb877ff364c95a556c69cb7dfb
commit 04ffdd331dad08cb877ff364c95a556c69cb7dfb
Author: Paulius Velesko <pvelesko@pglc.io>
Date:   Fri Jun 21 11:45:48 2024 +0300

    hipcc - disable nvcc find

diff --git a/HIPCC b/HIPCC
index 480795b5..2ca39d64 160000
--- a/HIPCC
+++ b/HIPCC
@@ -1 +1 @@
-Subproject commit 480795b507ded4df00aec81a10bd553dd14b0611
+Subproject commit 2ca39d6490743d249d4b1e50ad6c133b94e517b4
$ cd HIPCC/
$ git branch -a --contains 2ca39d6490743d249d4b1e50ad6c133b94e517b4
error: no such commit 2ca39d6490743d249d4b1e50ad6c133b94e517b4

Or in other words:

$ git checkout -b cucc-cpp origin/cucc-cpp
M   HIPCC
branch 'cucc-cpp' set up to track 'origin/cucc-cpp'.
Switched to a new branch 'cucc-cpp'
$ git submodule update --init --recursive 
fatal: remote error: upload-pack: not our ref 2ca39d6490743d249d4b1e50ad6c133b94e517b4
fatal: Fetched in submodule path 'HIPCC', but it did not contain 2ca39d6490743d249d4b1e50ad6c133b94e517b4. Direct fetching of that commit failed.

Did you forget to merge your branch into HIPCC with the relevant commit?

pvelesko commented 2 weeks ago

Forgot to push the branch. Try now

From: Zoltán Böszörményi @.> Date: Friday, June 21, 2024 at 13:29 To: CHIP-SPV/chipStar @.> Cc: Paulius Velesko @.>, Comment @.> Subject: Re: [CHIP-SPV/chipStar] cucc is a fork bomb if a symlink called nvcc pointing to cucc exists (Issue #877)

Also:

$ git show 04ffdd331dad08cb877ff364c95a556c69cb7dfb

commit 04ffdd331dad08cb877ff364c95a556c69cb7dfb

Author: Paulius Velesko @.***>

Date: Fri Jun 21 11:45:48 2024 +0300

hipcc - disable nvcc find

diff --git a/HIPCC b/HIPCC

index 480795b5..2ca39d64 160000

--- a/HIPCC

+++ b/HIPCC

@@ -1 +1 @@

-Subproject commit 480795b507ded4df00aec81a10bd553dd14b0611

+Subproject commit 2ca39d6490743d249d4b1e50ad6c133b94e517b4

$ cd HIPCC/

$ git branch -a --contains 2ca39d6490743d249d4b1e50ad6c133b94e517b4

error: no such commit 2ca39d6490743d249d4b1e50ad6c133b94e517b4

Or in other words:

$ git checkout -b cucc-cpp origin/cucc-cpp

M HIPCC

branch 'cucc-cpp' set up to track 'origin/cucc-cpp'.

Switched to a new branch 'cucc-cpp'

$ git submodule update --init --recursive

fatal: remote error: upload-pack: not our ref 2ca39d6490743d249d4b1e50ad6c133b94e517b4

fatal: Fetched in submodule path 'HIPCC', but it did not contain 2ca39d6490743d249d4b1e50ad6c133b94e517b4. Direct fetching of that commit failed.

Did you forget to merge your branch into HIPCC with the relevant commit?

— Reply to this email directly, view it on GitHubhttps://github.com/CHIP-SPV/chipStar/issues/877#issuecomment-2182479538, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ACCJBQI5PZ77RJBVQLPUF7DZIP57DAVCNFSM6AAAAABJVOEFTKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCOBSGQ3TSNJTHA. You are receiving this because you commented.Message ID: @.***>

zboszor commented 2 weeks ago

Thanks, I have tested it now. The call chain nvcc (symlinked to cucc) -> hipcc -> nvcc (symlinked to cucc) does not occur anymore.

PyTorch looks for more things and failing to find them, its CMake build system disables using CUDA:

warning: cucc is a work-in-progress. It is incomplete and may behave incorrectly.
Please, report issues at https://github.com/CHIP-SPV/chipStar/issues.
-- Could NOT find CUDA (missing: CUDA_TOOLKIT_ROOT_DIR CUDA_INCLUDE_DIRS CUDA_CUDART_LIBRARY) (found version "clang version 15.0.7 (https://github.com/llvm/llvm-project 8dfdcc7b7bf66834a761bd8de445840ef68e4d1a)
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: /home/zozo/test-yocto-4.3-gh/tmp-sicom-glibc/work/corei7-64-oe-linux/python3-pytorch/2.3.1/recipe-sysroot-native/usr/bin

.clang version 15.0.7 (https://github.com/llvm/llvm-project 8dfdcc7b7bf66834a761bd8de445840ef68e4d1a)
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: /home/zozo/test-yocto-4.3-gh/tmp-sicom-glibc/work/corei7-64-oe-linux/python3-pytorch/2.3.1/recipe-sysroot-native/usr/bin

")
CMake Warning at cmake/public/cuda.cmake:31 (message):
  Caffe2: CUDA cannot be found.  Depending on whether you are building Caffe2
  or a Caffe2 dependent library, the next warning / error will give you more
  info.
Call Stack (most recent call first):
  cmake/Dependencies.cmake:44 (include)
  CMakeLists.txt:754 (include)

CMake Warning at cmake/Dependencies.cmake:80 (message):
  Not compiling with CUDA.  Suppress this warning with -DUSE_CUDA=OFF.
Call Stack (most recent call first):
  CMakeLists.txt:754 (include)

So chipStar is not a complete CUDA replacement yet. Maybe chipStar and ZLUDA can join the effort.

ZLUDA seems to already have a CUDA runtime library, at least in one of the forks. But it's ROCm-only now after switching away from Intel-only.

I like the idea of chipStar, i.e. using any conformant OpenCL implementation to replace CUDA, that may also be Mesa Rusticl on any supported GPU in the long run.

pvelesko commented 2 weeks ago

PyTorch looks for more things and failing to find them, its CMake build system disables using CUDA:

Compiling CUDA Cmake projects is not well tested yet but in terms of getting PyTorch compiled a bigger issue here would be supporting non-core HIP/CUDA APIs such as cuDNN/MIOpen.

Maybe chipStar and ZLUDA can join the effort

ZLUDA has a different approach from chipStar and is only looking to support AMD hardware whereas we support Intel, AMD, Nvidia, and ARM - all to different extents of course.