Closed deroholic closed 9 months ago
I'm not really equipped to debug this issue, but a quick look and this seems a little strange:
H-100 is a sm_90 device. Does this mean that sm_90 uses the col_turing format again?
@deroholic I'm also running into this issue on an H100. Have you fixed this yet, and if so what is your solution?
No. I am awaiting official response. Somewhat critical as it impacts all of our plans to scale with the H-100.
Hi, I was in touch with the lambdalabs team about the same error, and this is what they've found out: I tried to run the notebook on an A10 and I got past the error. Looking closely into the code (~/.local/lib/python3.8/site-packages/bitsandbytes/functional.py) that throws the cublasLt ran into an error! prompt, it seems that it has conditions to check the architecture of the GPU (Turing or Ampere):
has_error = 0
ptrRowScale = get_ptr(None)
is_on_gpu([A, B, out])
if formatB == 'col_turing':
if dtype == torch.int32:
has_error = lib.cigemmlt_turing_32(
ptr, m, n, k, ptrA, ptrB, ptrC, ptrRowScale, lda, ldb, ldc
)
else:
has_error = lib.cigemmlt_turing_8(
ptr, m, n, k, ptrA, ptrB, ptrC, ptrRowScale, lda, ldb, ldc
)
elif formatB == "col_ampere":
if dtype == torch.int32:
has_error = lib.cigemmlt_ampere_32(
ptr, m, n, k, ptrA, ptrB, ptrC, ptrRowScale, lda, ldb, ldc
)
else:
has_error = lib.cigemmlt_ampere_8(
ptr, m, n, k, ptrA, ptrB, ptrC, ptrRowScale, lda, ldb, ldc
)
if has_error == 1:
print(f'A: {shapeA}, B: {shapeB}, C: {Sout[0]}; (lda, ldb, ldc): {(lda, ldb, ldc)}; (m, n, k): {(m, n, k)}')
raise Exception('cublasLt ran into an error!')
An A10 is built under the Ampere architecture. An H100 is using the Hopper architecture - which is not any of the code's conditions. I think that the program you are trying to run is not yet compatible with the H100. You could reach out to the developer to confirm this.
If this is actually the problem that causes the error, do you have any plans to take care of it in the near future? Thanks
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
During the first step of training, I get a crash (trace below.) I then tried to run test_modules.py and it fails also (report below).