Open karthik86248 opened 1 year ago
Is this consistent/reproducible? We have seen this from time to time, but typically it's been very intermittent. Will have a look. Thanks.
Yes, its 100% reproducible for us. Thanks for looking into it.
Was able to isolate the issue a bit more.
The issue seems to be in allocating memory by cuBLAS library with specific datatypes e.g. float.
Our cuBLAS version is :11.11.03
`int main(int argc, char *argv[])
{
size_t matrixSizeA = 18964675584 , matrixSizeB = 6192059904, matrixSizeC = 14359400064;
printf("#### args: matrixSizeA %lld matrixSizeB %lld matrixSizeC %lld \n", matrixSizeA, matrixSizeB, matrixSizeC);
try {
#if 1
float* d_A = cublas::device_memory::allocate<float>(matrixSizeA);
float* d_B = cublas::device_memory::allocate<float>(matrixSizeB); // fails here
float* d_C = cublas::device_memory::allocate<float>(matrixSizeC);
#endif
/* the below works
int8_t* d_A = cublas::device_memory::allocate<int8_t>(matrixSizeA);
int8_t* d_B = cublas::device_memory::allocate<int8_t>(matrixSizeB);
int8_t* d_C = cublas::device_memory::allocate<int8_t>(matrixSizeC);
*/
printf("DEBUG: After cublas::device_memory::allocate\n");
} catch (cublas::cuda_exception &e) {
cout << e << endl;
printf("testing cublasLt fail1 \n");
exit(-1);
} catch (cublas::cublas_exception &e) {
cout << e << endl;
printf("testing cublasLt fail2 \n");
exit(-1);
} catch (const std::exception & e){
cout << e.what() << endl;
printf("testing cublasLt fail3 \n");
exit(-1);
}
printf("Success\n");
return 0;
}`
o/p
#### args: matrixSizeA 18964675584 matrixSizeB 6192059904 matrixSizeC 14359400064
std::exception: out of memory
testing cublasLt fail1
Thanks for this additional information. Just back from holiday break so taking a look at this this week. I see you posted the cuBLAS version, great. Please also post the CUDA version and driver version (or just the output of nvidia-smi). Thanks.
Thank you. nvidia-smi o/p is inserted below:
+-----------------------------------------------------------------------------+ | NVIDIA-SMI 525.60.13 Driver Version: 525.60.13 CUDA Version: 12.0 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 NVIDIA A100 80G... Off | 00000000:17:00.0 Off | Off | | N/A 36C P0 65W / 300W | 0MiB / 81920MiB | 0% Default | | | | Disabled | +-------------------------------+----------------------+----------------------+ | 1 NVIDIA A100 80G... Off | 00000000:CA:00.0 Off | Off | | N/A 37C P0 65W / 300W | 0MiB / 81920MiB | 24% Default | | | | Disabled | +-------------------------------+----------------------+----------------------+
Thanks - I am in the process of trying to reproduce this issue with the software version you're using. Will provide an update here ASAP.
Please try the 2.4 branch. Dave made some tweaks to the matrix sizes to address the OOM. Please let us know how it goes.
Thanks you. There is a long running task running on our A100 card. Will run the modified code after the task finishes and get back.
The TF32 and FP32 tests which used to fail earlier are passing now with 2.4 branch. But, the INT8 test started to now now. Also, noticing that TF32 test is only consuming only about 40 GB of the GPU memory while earlier it used to be around 80 GB.
The results from the tool are attached.
GPUStressTest_output.txt
Thanks for the update. Interesting - the INT8 test passed on device 0, but failed on device 1, same parameter values. Dave and I need to reproduce this to debug. Concerning the memory footprint, this is an area we're looking to improve. The initial change was made to address the out-of-memory issue, we need to ramp-up the matrix sizes now to maximize the memory footprint. Will post an update here once we have a repo in house. Thanks so much for your input.
If you have a chance, try setting CUDA_VISIBLE_DEVICES=1, run gst. Want to ensure device 1 runs correctly when running 1 GPU at a time. Thanks.
If I run the tests only on device 1 (where the INT8 used to fail earlier), all the tests are passing now. O/p attached. So, something to do when the tests are run on both the devices.
Thanks for running that test. We have a fix for the multi-gpu issue that should get pushed out in the next day or two.
Ok. Will look forward to the fix. Thank you!
Was that fix pushed? I'm trying to run these tests on a node with 2 cards (A100) and still encountering this problem (perhaps I'm using the wrong branch?)
Running the GPUStress Tool on a A100 card is reporting the below error. However, card seems to be healthy and working correctly per the HW tests performed by our hardware vendor.
Command Executed: ./gst -T=1 Output: ./gst capturing GPU information... WATCHDOG starting, TIMEOUT: 600 seconds Detected 2 CUDA Capable device(s) Device 0: "NVIDIA A100 80GB PCIe" Device 1: "NVIDIA A100 80GB PCIe" Initilizing A100 80 GB based test suite TYPE=2 GPU Memory: 79, memgb: 80 Device 0: "NVIDIA A100 80GB PCIe", PCIe: 17 ***** STARTING TEST 0: INT8 On Device 0 NVIDIA A100 80GB PCIe
math_type 10
args: matrixSizeA 34878833064 matrixSizeB 16672535724 matrixSizeC 28662757344
args: ta=N tb=T m=244872 n=117052 k=142437 lda=7835904 ldb=3745792 ldc=7835904
loop=1 TEST INT8 On Device 0 NVIDIA A100 80GB PCIe TEST PASSED ** TEST TIME: 24 seconds *** STARTING TEST 1: FP16 On Device 0 NVIDIA A100 80GB PCIe
math_type 0
args: matrixSizeA 13000629632 matrixSizeB 13114567936 matrixSizeC 13557084032
args: ta=N tb=N m=115928 n=116944 k=112144 lda=115928 ldb=112144 ldc=115928
loop=1 TEST FP16 On Device 0 NVIDIA A100 80GB PCIe TEST PASSED ** TEST TIME: 17 seconds *** STARTING TEST 2: TF32 On Device 0 NVIDIA A100 80GB PCIe
math_type 0
args: matrixSizeA 18964675584 matrixSizeB 6192059904 matrixSizeC 14359400064
std::exception: out of memory testing cublasLt fail