deepseek-coder-33b-instruct model on tgis fails with flash attention and generates wrong output without flash attention

maxdebayser commented 4 months ago

Describe the bug

When I run some of the benchmarks lm-eval with deepseek on tgis, it fails in two differente ways whereas on vLLM the same models runs to completion.

On tgis, with tgis_native engine and flash attention the generation fails on the server side and prints this error:

2024-05-09T16:35:50.424467Z  INFO text_generation_router::batcher: src/batcher.rs:581: New or updated batch #1 of size 1 (7 total toks), max new toks = 1
2024-05-09T16:35:50.473639Z  INFO text_generation_router::batcher: src/batcher.rs:641: Prefill took 49.141722ms for 1 inputs, 7 total tokens
2024-05-09T16:35:50.473665Z  INFO generate{input=["The best ice cream flavor is:"] prefix_id=None correlation_id="<none>" input_bytes=[29] params=Some(Parameters { method: Greedy, sampling: None, stopping: Some(StoppingCriteria { max_new_tokens: 1, min_new_tokens: 1, time_limit_millis: 0, stop_sequences: [], include_stop_sequence: None }), response: Some(ResponseOptions { input_text: false, generated_tokens: true, input_tokens: true, token_logprobs: true, token_ranks: true, top_n_tokens: 0 }), decoding: Some(DecodingParameters { repetition_penalty: 0.0, length_penalty: None }), truncate_input_tokens: 0 }) request_id=0 validation_time="198.411µs" queue_time="6.793µs" inference_time="49.177872ms" time_per_token="49.177872ms" total_time="49.409502ms" input_toks=7}: text_generation_router::grpc_server: src/grpc_server.rs:513: Request generated 1 tokens before MaxTokens, output 1 bytes: " "
2024-05-09T16:35:50.476500Z  INFO text_generation_router::queue: src/queue.rs:410: Chose 16 out of 16 requests from buffer, total now 16
2024-05-09T16:35:50.476534Z  INFO text_generation_router::batcher: src/batcher.rs:581: New or updated batch #2 of size 16 (361 total toks), max new toks = 1
Shard 0: ERROR:root:Prefill failed
Shard 0: Traceback (most recent call last):
Shard 0:   File "/opt/tgis/lib/python3.11/site-packages/text_generation_server/server.py", line 43, in func_with_log
Shard 0:     return await func(*args, **kwargs)
Shard 0:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
Shard 0:   File "/opt/tgis/lib/python3.11/site-packages/text_generation_server/server.py", line 131, in Prefill
Shard 0:     batch, errors = self.model.batch_type.from_pb(
Shard 0:                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Shard 0:   File "/opt/tgis/lib/python3.11/site-packages/text_generation_server/models/flash_causal_lm.py", line 150, in from_pb
Shard 0:     all_input_ids_tensor[i, input_length - r.input_length:input_length] = tokenized_input
Shard 0:     ~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Shard 0: RuntimeError: The expanded size of the tensor (19) must match the existing size (20) at non-singleton dimension 0.  Target sizes: [19].  Tensor sizes: [20]
Shard 0: ERROR:grpc._cython.cygrpc:Unexpected [RuntimeError] raised by servicer method [/generate.v1.TextGenerationService/Prefill]
Shard 0: Traceback (most recent call last):
Shard 0:   File "src/python/grpcio/grpc/_cython/_cygrpc/aio/server.pyx.pxi", line 689, in grpc._cython.cygrpc._handle_exceptions
Shard 0:   File "src/python/grpcio/grpc/_cython/_cygrpc/aio/server.pyx.pxi", line 821, in _handle_rpc
Shard 0:   File "src/python/grpcio/grpc/_cython/_cygrpc/aio/server.pyx.pxi", line 554, in _handle_unary_unary_rpc
Shard 0:   File "src/python/grpcio/grpc/_cython/_cygrpc/aio/server.pyx.pxi", line 408, in _finish_handler_with_unary_response
Shard 0:   File "/opt/tgis/lib/python3.11/site-packages/text_generation_server/server.py", line 43, in func_with_log
Shard 0:     return await func(*args, **kwargs)
Shard 0:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
Shard 0:   File "/opt/tgis/lib/python3.11/site-packages/text_generation_server/server.py", line 131, in Prefill
Shard 0:     batch, errors = self.model.batch_type.from_pb(
Shard 0:                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Shard 0:   File "/opt/tgis/lib/python3.11/site-packages/text_generation_server/models/flash_causal_lm.py", line 150, in from_pb
Shard 0:     all_input_ids_tensor[i, input_length - r.input_length:input_length] = tokenized_input
Shard 0:     ~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Shard 0: RuntimeError: The expanded size of the tensor (19) must match the existing size (20) at non-singleton dimension 0.  Target sizes: [19].  Tensor sizes: [20]

With the hf_transformers engine the generation doesn't fail in the server side but the input tokens returned to the client are wrong. It seems that the first token is replaced by a begin_of_sentence token:

RuntimeError: There is an unexpected difference between tokenizer and model tokens:
context_tokens=['L', 'ind', 'sey', 'Ġlike', 'Ġto', 'Ġread', 'Ġgraphic', 'Ġnovels', 'Ġbut', 'ĠN', 'atal', 'ie', 'Ġliked', 'Ġclassic', 'Ġliterature', 'Ġto', 'Ġread', '.', 'ĠN', 'atal', 'ie']
response_tokens=['<｜begin▁of▁sentence｜>', 'ind', 'sey', 'Ġlike', 'Ġto', 'Ġread', 'Ġgraphic', 'Ġnovels', 'Ġbut', 'ĠN', 'atal', 'ie', 'Ġliked', 'Ġclassic', 'Ġliterature', 'Ġto', 'Ġread', '.', 'ĠN', 'atal', 'ie']
Perhaps this is required: export OUTPUT_SPECIAL_TOKENS="true" && export DEFAULT_INCLUDE_STOP_SEQS="false"v

maxdebayser commented 4 months ago

I've tried the 1.3b and 6.7b deepseek models but they run without problems :thinking:

maxdebayser commented 4 months ago

The problem happened with revision 6f09197224af9638c32c01a9060e78b0cf5a4479 of the model. With revision 61dc97b922b13995e7f83b7c8397701dbf9cfd4c it doesn't happen. So it's not a tgis issue.

IBM / text-generation-inference

deepseek-coder-33b-instruct model on tgis fails with flash attention and generates wrong output without flash attention #92