Failure on A100 Card - Githubissues

karthik86248 commented 1 year ago

Running the GPUStress Tool on a A100 card is reporting the below error. However, card seems to be healthy and working correctly per the HW tests performed by our hardware vendor.

Command Executed: ./gst -T=1 Output: ./gst capturing GPU information... WATCHDOG starting, TIMEOUT: 600 seconds Detected 2 CUDA Capable device(s) Device 0: "NVIDIA A100 80GB PCIe" Device 1: "NVIDIA A100 80GB PCIe" Initilizing A100 80 GB based test suite TYPE=2 GPU Memory: 79, memgb: 80 Device 0: "NVIDIA A100 80GB PCIe", PCIe: 17 ***** STARTING TEST 0: INT8 On Device 0 NVIDIA A100 80GB PCIe

math_type 10

args: matrixSizeA 34878833064 matrixSizeB 16672535724 matrixSizeC 28662757344

args: ta=N tb=T m=244872 n=117052 k=142437 lda=7835904 ldb=3745792 ldc=7835904

loop=1 TEST INT8 On Device 0 NVIDIA A100 80GB PCIe TEST PASSED ** TEST TIME: 24 seconds *** STARTING TEST 1: FP16 On Device 0 NVIDIA A100 80GB PCIe

math_type 0

args: matrixSizeA 13000629632 matrixSizeB 13114567936 matrixSizeC 13557084032

args: ta=N tb=N m=115928 n=116944 k=112144 lda=115928 ldb=112144 ldc=115928

loop=1 TEST FP16 On Device 0 NVIDIA A100 80GB PCIe TEST PASSED ** TEST TIME: 17 seconds *** STARTING TEST 2: TF32 On Device 0 NVIDIA A100 80GB PCIe

math_type 0

args: matrixSizeA 18964675584 matrixSizeB 6192059904 matrixSizeC 14359400064

std::exception: out of memory testing cublasLt fail

jmauro-nv commented 1 year ago

Is this consistent/reproducible? We have seen this from time to time, but typically it's been very intermittent. Will have a look. Thanks.

karthik86248 commented 1 year ago

Yes, its 100% reproducible for us. Thanks for looking into it.

karthik86248 commented 1 year ago

Was able to isolate the issue a bit more.
The issue seems to be in allocating memory by cuBLAS library with specific datatypes e.g. float. Our cuBLAS version is :11.11.03

`int main(int argc, char *argv[])
{
    size_t matrixSizeA = 18964675584 , matrixSizeB = 6192059904, matrixSizeC = 14359400064;
    printf("#### args: matrixSizeA %lld matrixSizeB %lld matrixSizeC %lld \n", matrixSizeA, matrixSizeB, matrixSizeC);

  try {
    #if 1
    float* d_A = cublas::device_memory::allocate<float>(matrixSizeA);
    float* d_B = cublas::device_memory::allocate<float>(matrixSizeB); // fails here
    float* d_C = cublas::device_memory::allocate<float>(matrixSizeC);
    #endif
    /* the below works
    int8_t* d_A = cublas::device_memory::allocate<int8_t>(matrixSizeA);
    int8_t* d_B = cublas::device_memory::allocate<int8_t>(matrixSizeB);
    int8_t* d_C = cublas::device_memory::allocate<int8_t>(matrixSizeC);
    */
    printf("DEBUG: After  cublas::device_memory::allocate\n");
 } catch (cublas::cuda_exception &e) {
    cout << e << endl;
    printf("testing cublasLt fail1 \n");
    exit(-1);
  } catch (cublas::cublas_exception &e) {
    cout << e << endl;
    printf("testing cublasLt fail2 \n");
    exit(-1);
  } catch (const std::exception & e){
    cout << e.what() << endl;
    printf("testing cublasLt fail3 \n");
    exit(-1);
  }
     printf("Success\n");
    return 0;
}`
o/p
#### args: matrixSizeA 18964675584 matrixSizeB 6192059904 matrixSizeC 14359400064
std::exception: out of memory
testing cublasLt fail1

jmauro-nv commented 1 year ago

Thanks for this additional information. Just back from holiday break so taking a look at this this week. I see you posted the cuBLAS version, great. Please also post the CUDA version and driver version (or just the output of nvidia-smi). Thanks.

karthik86248 commented 1 year ago

jmauro-nv commented 1 year ago

Thanks - I am in the process of trying to reproduce this issue with the software version you're using. Will provide an update here ASAP.

jmauro-nv commented 1 year ago

Please try the 2.4 branch. Dave made some tweaks to the matrix sizes to address the OOM. Please let us know how it goes.

karthik86248 commented 1 year ago

Thanks you. There is a long running task running on our A100 card. Will run the modified code after the task finishes and get back.

karthik86248 commented 1 year ago

The TF32 and FP32 tests which used to fail earlier are passing now with 2.4 branch. But, the INT8 test started to now now. Also, noticing that TF32 test is only consuming only about 40 GB of the GPU memory while earlier it used to be around 80 GB. The results from the tool are attached.
GPUStressTest_output.txt

jmauro-nv commented 1 year ago

Thanks for the update. Interesting - the INT8 test passed on device 0, but failed on device 1, same parameter values. Dave and I need to reproduce this to debug. Concerning the memory footprint, this is an area we're looking to improve. The initial change was made to address the out-of-memory issue, we need to ramp-up the matrix sizes now to maximize the memory footprint. Will post an update here once we have a repo in house. Thanks so much for your input.

jmauro-nv commented 1 year ago

If you have a chance, try setting CUDA_VISIBLE_DEVICES=1, run gst. Want to ensure device 1 runs correctly when running 1 GPU at a time. Thanks.

karthik86248 commented 1 year ago

If I run the tests only on device 1 (where the INT8 used to fail earlier), all the tests are passing now. O/p attached. So, something to do when the tests are run on both the devices.

GPUStressTest_output_device1.txt

jmauro-nv commented 1 year ago

Thanks for running that test. We have a fix for the multi-gpu issue that should get pushed out in the next day or two.

karthik86248 commented 1 year ago

Ok. Will look forward to the fix. Thank you!

davide-q commented 1 year ago

Was that fix pushed? I'm trying to run these tests on a node with 2 cards (A100) and still encountering this problem (perhaps I'm using the wrong branch?)

NVIDIA / GPUStressTest

Failure on A100 Card #5

math_type 10

args: matrixSizeA 34878833064 matrixSizeB 16672535724 matrixSizeC 28662757344

args: ta=N tb=T m=244872 n=117052 k=142437 lda=7835904 ldb=3745792 ldc=7835904

math_type 0

args: matrixSizeA 13000629632 matrixSizeB 13114567936 matrixSizeC 13557084032

args: ta=N tb=N m=115928 n=116944 k=112144 lda=115928 ldb=112144 ldc=115928

math_type 0

args: matrixSizeA 18964675584 matrixSizeB 6192059904 matrixSizeC 14359400064