QuantTransformer inference error

dearchill commented 2 years ago

Hi, I have built the newest version of lightseq directly from code, then I export the lightseq model using _ls_fs_transformer_ptqexport.py and do inference, I meet errors as follows:

generator config beam size: 4 max step: 512 extra decode length(max decode length - src input length): 50 length penalty: 1 diverse lambda: 0 sampling method: beam_search topk: 1 topp: 0.75 encoder buffer init start encoder buffer init succeed decoder buffer init start decoder buffer init succeed Traceback (most recent call last): File "ls_fs_transformer_ptq_export.py", line 114, in pb_output = pb_model.infer(src) RuntimeError: [CUDA][ERROR] /tmp/pip-req-build-dnod68xc/lightseq/inference/model/quant_decoder.cc.cu(673): CUBLAS_STATUS_INVALID_VALUE

I have looked up the NVIDIA official doc for cublasGemmStridedBatchedEx, and found the reason "the parameters m,n,k<0", but I can't see clearly why it raised the error on this line. Do you have any ideas? Thx in advance.

My environment:

CUDA Version: 10.1.243 nvcc: V10.1.243 libcublas.so: 10.2.1.243 Pytorch: 1.6.0+cu101

Taka152 commented 2 years ago

Thant's cool, you already have tried our latest feature. About your error, you can try building with cuda toolkit 11.6, some quantized implementation doesn't support cuda 10.

On Wed, Mar 9, 2022 at 7:06 PM dearchill @.***> wrote:

Hi, I have built the newest version of lightseq directly from code, then I export the lightseq model using ls_fs_transformer_ptq_export.py and do inference, I meet errors as follows:

generator config beam size: 4 max step: 512 extra decode length(max decode length - src input length): 50 length penalty: 1 diverse lambda: 0 sampling method: beam_search topk: 1 topp: 0.75 encoder buffer init start encoder buffer init succeed decoder buffer init start decoder buffer init succeed Traceback (most recent call last): File "ls_fs_transformer_ptq_export.py", line 114, in pb_output = pb_model.infer(src) RuntimeError: [CUDA][ERROR] /tmp/pip-req-build-dnod68xc/lightseq/inference/model/quant_decoder.cc.cu(673): CUBLAS_STATUS_INVALID_VALUE

I have looked up the NVIDIA official doc for cublasGemmStridedBatchedEx https://docs.nvidia.com/cuda/cublas/index.html#cublas-GemmStridedBatchedEx, and found the reason "the parameters m,n,k<0", but I can't see clearly why it raised the error on this line https://github.com/bytedance/lightseq/blob/69889143ff10ba107a382009b87247720d1853af/lightseq/inference/model/quant_decoder.cc.cu#L667. Do you have any ideas? Thx in advance.

My environment:

CUDA Version: 10.1.243 nvcc: V10.1.243 libcublas.so: 10.2.1.243 Pytorch: 1.6.0+cu101

— Reply to this email directly, view it on GitHub https://github.com/bytedance/lightseq/issues/282, or unsubscribe https://github.com/notifications/unsubscribe-auth/AELIZAI6HGND5SVL7OO2LHTU7CAZLANCNFSM5QJGFYWQ . You are receiving this because you are subscribed to this thread.Message ID: @.***>

dearchill commented 2 years ago

Hi @Taka152 I tried again, and failed in the last step.

generator config beam size: 4 max step: 512 extra decode length(max decode length - src input length): 50 length penalty: 1 diverse lambda: 0 sampling method: beam_search topk: 1 topp: 0.75 encoder buffer init start encoder buffer init succeed decoder buffer init start decoder buffer init succeed batch_size-1 batch_seq_len-12 batch_token_ids: 7093, 5, 274, 8, 1565, 4742, 1859, 62, 22, 5907, 6, 2, emb out: token-0 emb out: -0.358773, -0.26908, 0.627853, -0.986626, 1.79387, -1.25571, 1.16601, 0.627853, -0.717546, 1.97325, emb out: token-1 emb out: 0.751778, 0.821779, 0.801805, 0.691957, 0.761404, 0.741144, 0.720942, 0.700857, 0.591249, 0.750938, emb out: token-2 emb out: 0.819604, 0.846817, 0.868601, 1.06475, 0.718123, 0.815741, 0.550753, 0.282299, 0.728279, 0.812705, emb out: token-3 emb out: 0.14112, 0.245481, 0.253828, 0.344976, 0.518559, 0.505312, 0.753661, 0.725527, 0.510784, 1.00662, emb out: token-4 emb out: -0.398029, -1.19492, -0.9065, -0.0740629, 0.402685, -1.36223, 0.459194, -0.592658, -1.55929, -0.0200408, emb out: token-5 emb out: -1.40739, -0.72485, -1.53631, -1.33338, -0.657715, -0.679081, -0.324947, 1.56701, 0.777749, -0.543282, emb out: token-6 emb out: -0.458802, -0.20686, -0.55554, -0.693224, -1.15586, -1.4049, -1.9796, -1.26769, -0.438151, -0.750191, emb out: token-7 emb out: 0.4776, 0.451542, 0.0475988, -0.00203449, -0.222965, -0.515426, -0.513153, -0.480218, -1.2211, -1.29987, emb out: token-8 emb out: 0.989358, 0.990524, 0.826826, 0.870073, 0.508, 0.384811, 0.0677459, -0.0703457, -0.286718, -0.751465, emb out: token-9 emb out: -0.305428, 1.12574, 1.13749, 0.706427, 0.639135, 1.30119, 3.15308, 1.45618, 1.42901, 0.845204, emb out: token-10 emb out: -0.633714, -0.218695, 0.121383, 0.346807, 0.69615, 0.970254, 1.07023, 1.08582, 0.844965, 0.6291, emb out: token-11 emb out: -0.73091, -1.46466, -0.454259, -0.341308, 0.17377, 0.0605199, 0.358573, 0.323674, 0.926592, 0.816268, token embedding weight: Traceback (most recent call last): File "ls_fs_transformer_ptq_export.py", line 115, in pb_output = pb_model.infer(src) RuntimeError: [CUDA][ERROR] /data/lightseq/lightseq/inference/tools/util.cc.cu(54): cudaErrorInvalidValueinvalid argument

bytedance / lightseq

QuantTransformer inference error #282