alibaba / rtp-llm

RTP-LLM: Alibaba's high-performance LLM inference engine for diverse applications.
Apache License 2.0
521 stars 48 forks source link

fix(src): fix bazel build special type cast and template match for cuda118 #77

Closed khan-yin closed 2 months ago

khan-yin commented 3 months ago

fix(src): fix bazel build special type cast and template match for cuda118

Local Environment

Build Result

bazel build

image

bazel build test

image image image

whl

image

CLAassistant commented 3 months ago

CLA assistant check
All committers have signed the CLA.

samaritan1998 commented 3 months ago

请问cuda11.4能运行么?

netaddi commented 3 months ago

Hi there, Thank you for your contribution ! Have you tested whether the code works under cuda 12 ?

khan-yin commented 3 months ago

请问cuda11.4能运行么?

I think it will be ok, I will test it futher.

khan-yin commented 3 months ago

Hi there, Thank you for your contribution ! Have you tested whether the code works under cuda 12 ?

I will test it latter.

netaddi commented 3 months ago

please note that //example:test is deprecated and may not reflect the correctness of code. If possible, please consider run

bazel test ... --build_tests_only=1 --test_tag_filters=-manual,-rocm --config=cuda12

for full test under cuda12.

khan-yin commented 3 months ago

@samaritan1998 @netaddi hello,I have already tested the code with cuda12.2 CUDNN9 and cuda11.4 CUDNN8, it can build successfully and the example test can be passed. For full test under cu12, I always failed with the error which related with memory limited, maybe my environment could not test all this? image

netaddi commented 3 months ago

@samaritan1998 @netaddi hello,I have already tested the code with cuda12.2 CUDNN9 and cuda11.4 CUDNN8, it can build successfully and the example test can be passed. For full test under cu12, I always failed with the error which related with memory limited, maybe my environment could not test all this? image

Hi khan, than you for your patience. It seems that you did not include critical information in your screen shot, I can not see the error information. However, if you think this error is caused by lack of system memory, please try add option --jobs=8 which limits the number of concurrent compilation processes.

khan-yin commented 3 months ago

@samaritan1998 @netaddi hello,I have already tested the code with cuda12.2 CUDNN9 and cuda11.4 CUDNN8, it can build successfully and the example test can be passed. For full test under cu12, I always failed with the error which related with memory limited, maybe my environment could not test all this? image

Hi khan, than you for your patience. It seems that you did not include critical information in your screen shot, I can not see the error information. However, if you think this error is caused by lack of system memory, please try add option --jobs=8 which limits the number of concurrent compilation processes.

thanks a lot! It really helps. Anyway, does all the LLM model weights should be downloaded for test? I am not sure if it is enough for storage.

khan-yin commented 3 months ago

@samaritan1998 @netaddi hello,I have already tested the code with cuda12.2 CUDNN9 and cuda11.4 CUDNN8, it can build successfully and the example test can be passed. For full test under cu12, I always failed with the error which related with memory limited, maybe my environment could not test all this? image

Hi khan, than you for your patience. It seems that you did not include critical information in your screen shot, I can not see the error information. However, if you think this error is caused by lack of system memory, please try add option --jobs=8 which limits the number of concurrent compilation processes.

thanks a lot! It really helps. Anyway, does all the LLM model weights should be downloaded for test? I am not sure if it is enough for storage and a long time for git lfs pull.

netaddi commented 3 months ago

@samaritan1998 @netaddi hello,I have already tested the code with cuda12.2 CUDNN9 and cuda11.4 CUDNN8, it can build successfully and the example test can be passed. For full test under cu12, I always failed with the error which related with memory limited, maybe my environment could not test all this? image

Hi khan, than you for your patience. It seems that you did not include critical information in your screen shot, I can not see the error information. However, if you think this error is caused by lack of system memory, please try add option --jobs=8 which limits the number of concurrent compilation processes.

thanks a lot! It really helps. Anyway, does all the LLM model weights should be downloaded for test? I am not sure if it is enough for storage.

No. I think it would be fine if all unit tests (...) are passed. We have internal verification steps that will run after pull request is approved.

khan-yin commented 3 months ago

Hello, I have already 60/74 tests, all the src op test is passed, the rest 14 cases details are below: image

FAILED:
//maga_transformer/async_decoder_engine/test:async_model_test
//maga_transformer/async_decoder_engine/test:decoder_engine_test
//maga_transformer/async_decoder_engine/test:rpc_model_test
//maga_transformer/cpp/test:gpt_model_test
//maga_transformer/models/test:llama_test
//maga_transformer/server/test:inference_worker_test
//maga_transformer/test:async_gather_batch_test
//maga_transformer/test:slice_stop_word_list_test 
//maga_transformer/test:template_test 
//maga_transformer/utils/test:ckpt_database_test 
//maga_transformer/utils/test:incremental_decode_test 
//maga_transformer/utils/test:model_weights_loader_test 

TIMEOUT:
//src/fastertransformer/devices/cuda_impl/tests:cuda_dist_test 

FAILED in src:
//src/fastertransformer/devices/cuda_impl/tests:cuda_attention_op_test

details:
src/fastertransformer/devices/cuda_impl/tests/ops/CudaAttentionOpTest.cc:457: Failure
Value of: static_cast<CudaDevice*>(device_)->use_multi_block_mode
  Actual: false
Expected: true

[----------] Global test environment tear-down
[==========] 8 tests from 1 test suite ran. (4723 ms total)
[  PASSED  ] 6 tests.
[  FAILED  ] 2 tests, listed below:
[  FAILED  ] CudaAttentionOpTest.MultiBlockSelfAttentionOpTest
[  FAILED  ] CudaAttentionOpTest.LongSeqMultiBlockSelfAttentionOpTest

image

Possible Reasons:

  1. FAILED CASE: I have trouble downloading the weights through GFS mostly because of the lack of weights, maybe you can help to test them in internal test? I have tested the tokenizer.model loading it is ok. For Example:
    $git lfs pull --include="*.model"
    batch response: This repository is over its data quota. Account responsible for LFS bandwidth should purchase more data packs to restore access.
  2. TIMEOUT CASE: I only have 1 GPU on the machine, the test need multi-device for dist_test.
  3. FAILED in src CASE: I think the code is not related to my code? It seems my env do not support multi-block attention?
netaddi commented 2 months ago

Thanks khan, we have verified the correctness of your code and merged your pull request.

khan-yin commented 2 months ago

Thanks khan, we have verified the correctness of your code and merged your pull request.

Thank you!Maybe I could apply for collaborator to provide more contributions, I am interested in MLsys and C++/CUDA recently.🤣 image