beehive-lab / TornadoVM

TornadoVM: A practical and efficient heterogeneous programming framework for managed languages
https://www.tornadovm.org
Apache License 2.0
1.19k stars 114 forks source link

Total Number of Block Threads = 2147483647 (BACKEND=PTX, UBUNTU 22.04) #222

Closed zealbell closed 1 year ago

zealbell commented 1 year ago

Is it normal for the Total Number of Block Threads to be 2147483647 whenever my backend is ptx

$ tornado --devices
WARNING: Using incubator modules: jdk.incubator.vector, jdk.incubator.foreign

Number of Tornado drivers: 1
Driver: PTX
  Total number of PTX devices  : 1
  Tornado device=1:0
        PTX -- PTX -- NVIDIA GeForce RTX 3080
                Global Memory Size: 11.8 GB
                Local Memory Size: 48.0 KB
                Workgroup Dimensions: 3
                Total Number of Block Threads: 2147483647 ##same value as  Integer.MAX_VALUE? 
                Max WorkGroup Configuration: [1024, 1024, 64]
                Device OpenCL C version: N/A

meanwhile whenever my backend is opencl Total Number of Block Threads is 1024

$ tornado --devices
WARNING: Using incubator modules: jdk.incubator.vector, jdk.incubator.foreign

Number of Tornado drivers: 1
Driver: OpenCL
  Total number of OpenCL devices  : 1
  Tornado device=0:0
        OPENCL --  [NVIDIA CUDA] -- NVIDIA GeForce RTX 3080
                Global Memory Size: 11.8 GB
                Local Memory Size: 48.0 KB
                Workgroup Dimensions: 3
                Total Number of Block Threads: 1024
                Max WorkGroup Configuration: [1024, 1024, 64]
                Device OpenCL C version: OpenCL C 1.2

NVIDIA-SMI 520.61.05 Driver Version: 520.61.05 CUDA Version: 11.8

Originally posted by @54LiNKeR in https://github.com/beehive-lab/TornadoVM/discussions/221

jjfumero commented 1 year ago

It seems this value is correct.

On my 2060:

 tornado --devices
WARNING: Using incubator modules: jdk.incubator.foreign, jdk.incubator.vector

Number of Tornado drivers: 1
Driver: PTX
  Total number of PTX devices  : 1
  Tornado device=0:0
    PTX -- PTX -- NVIDIA GeForce RTX 2060 with Max-Q Design
        Global Memory Size: 5.8 GB
        Local Memory Size: 48.0 KB
        Workgroup Dimensions: 3
        Total Number of Block Threads: [2147483647, 65535, 65535]
        Max WorkGroup Configuration: [1024, 1024, 64]
        Device OpenCL C version: N/A

When running the deviceQuery from the NVIDIA Cuda Samples, we get the following (we are looking for Max dimension size of a grid size):

./deviceQuery 
./deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: "NVIDIA GeForce RTX 2060 with Max-Q Design"
  CUDA Driver Version / Runtime Version          11.6 / 11.4
  CUDA Capability Major/Minor version number:    7.5
  Total amount of global memory:                 5935 MBytes (6222970880 bytes)
  (030) Multiprocessors, (064) CUDA Cores/MP:    1920 CUDA Cores
  GPU Max Clock rate:                            1185 MHz (1.18 GHz)
  Memory Clock rate:                             5501 Mhz
  Memory Bus Width:                              192-bit
  L2 Cache Size:                                 3145728 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
  Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total shared memory per multiprocessor:        65536 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  1024
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)          <<< This is the equivalent value we are looking for
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 3 copy engine(s)
  Run time limit on kernels:                     Yes
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      Yes
  Device supports Managed Memory:                Yes
  Device supports Compute Preemption:            Yes
  Supports Cooperative Kernel Launch:            Yes
  Supports MultiDevice Co-op Kernel Launch:      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 1 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 11.6, CUDA Runtime Version = 11.4, NumDevs = 1
Result = PASS
zealbell commented 1 year ago

Alright then, Just wanted to be sure this isn’t a bug because as I hinted earlier with opencl as backend the value reported is different with what’s reported with ptx for the same device.

jjfumero commented 1 year ago

Yes, in CUDA, the block of threads is calculated differently compared to OpenCL and Level Zero.

jjfumero commented 1 year ago

It seems this issue can be closed. Feel free to open new issues if you have more questions or find new bugs.