deep-floyd / IF

Other
7.64k stars 497 forks source link

t5 = T5Embedder(device='cpu') works, very slowly. t5 = T5Embedder(device='cuda:0') causes a runtime error. #79

Closed phalexo closed 1 year ago

phalexo commented 1 year ago

I have tried all kinds of combinations torch 1.13.1 and 2.0.0, CUDA 11.3 and CUDA 11.8.

torch.matmul fails on a GPU.

kanttouchthis commented 1 year ago

what is the error? Edit: it runs fine on my setup with T5Embedder(device='cuda:0'). the model defaults to bfloat16, so maybe try specifying a different dtype: t5 = T5Embedder(device='cuda:0', torch_dtype=torch.float16). can you post the result of python -m torch.utils.collect_env to see if there are any issues with your install?

phalexo commented 1 year ago

what is the error? Edit: it runs fine on my setup with T5Embedder(device='cuda:0'). the model defaults to bfloat16, so maybe try specifying a different dtype: t5 = T5Embedder(device='cuda:0', torch_dtype=torch.float16). can you post the result of python -m torch.utils.collect_env to see if there are any issues with your install?

166 return module._hf_hook.post_forward(module, output)

File ~/.local/lib/python3.8/site-packages/transformers/models/t5/modeling_t5.py:530, in T5Attention.forward(self, hidden_states, mask, key_value_states, position_bias, past_key_value, layer_head_mask, query_length, use_cache, output_attentions) 525 value_states = project( 526 hidden_states, self.v, key_value_states, past_key_value[1] if past_key_value is not None else None 527 ) 529 # compute scores --> 530 scores = torch.matmul( 531 query_states, key_states.transpose(3, 2) 532 ) # equivalent of torch.einsum("bnqd,bnkd->bnqk", query_states, key_states), compatible with onnx op>9 534 if position_bias is None: 535 if not self.has_relative_attention_bias:

RuntimeError: CUDA error: CUBLAS_STATUS_NOT_SUPPORTED when calling cublasGemmStridedBatchedExFix(handle, opa, opb, (int)m, (int)n, (int)k, (void)&falpha, a, CUDA_R_16BF, (int)lda, stridea, b, CUDA_R_16BF, (int)ldb, strideb, (void)&fbeta, c, CUDA_R_16BF, (int)ldc, stridec, (int)num_batches, CUDA_R_32F, CUBLAS_GEMM_DEFAULT_TENSOR_OP)

kanttouchthis commented 1 year ago

what is the error? Edit: it runs fine on my setup with T5Embedder(device='cuda:0'). the model defaults to bfloat16, so maybe try specifying a different dtype: t5 = T5Embedder(device='cuda:0', torch_dtype=torch.float16). can you post the result of python -m torch.utils.collect_env to see if there are any issues with your install?

166 return module._hf_hook.post_forward(module, output)

File ~/.local/lib/python3.8/site-packages/transformers/models/t5/modeling_t5.py:530, in T5Attention.forward(self, hidden_states, mask, key_value_states, position_bias, past_key_value, layer_head_mask, query_length, use_cache, output_attentions) 525 value_states = project( 526 hidden_states, self.v, key_value_states, past_key_value[1] if past_key_value is not None else None 527 ) 529 # compute scores --> 530 scores = torch.matmul( 531 query_states, key_states.transpose(3, 2) 532 ) # equivalent of torch.einsum("bnqd,bnkd->bnqk", query_states, key_states), compatible with onnx op>9 534 if position_bias is None: 535 if not self.has_relative_attention_bias:

RuntimeError: CUDA error: CUBLAS_STATUS_NOT_SUPPORTED when calling cublasGemmStridedBatchedExFix(handle, opa, opb, (int)m, (int)n, (int)k, (void)&falpha, a, CUDA_R_16BF, (int)lda, stridea, b, CUDA_R_16BF, (int)ldb, strideb, (void)&fbeta, c, CUDA_R_16BF, (int)ldc, stridec, (int)num_batches, CUDA_R_32F, CUBLAS_GEMM_DEFAULT_TENSOR_OP)

have you tried manually installing libcublas? how much ram/vram do you have? according to this thread for a different model, this error can be a disguised OOM error

kanttouchthis commented 1 year ago

If you are trying to run the encoder and all 3 stages on gpu, you would need ~23 GBs for the models alone, plus the memory required to actually run them, so a gpu with 24GB probably isn't enough. with the diffusers implementation you should be able to run it just fine though, if you enable cpu offload, assuming you have enough (~32GB) system memory.

phalexo commented 1 year ago

If you are trying to run the encoder and all 3 stages on gpu, you would need ~23 GBs for the models alone, plus the memory required to actually run them, so a gpu with 24GB probably isn't enough. with the diffusers implementation you should be able to run it just fine though, if you enable cpu offload, assuming you have enough (~32GB) system memory.

I am running the pipeline over 3 or 4 GPUs. Each has 12GB, looking at VRAM utilization, it seems OK.

Changing the data type to float16 fixed it.

Thanks for the suggestion.