[BUG] model re-compiled each time, time consuming.

didadida-r commented 3 months ago

Feel free to ask any kind of questions in the issues page, but please use English since other users may find your questions valuable.

Describe the bug A clear and concise description of what the bug is. hi, each time i run the generate, the model will re-compiled again. The infer did become faster after compiled(47.72it/s), but adding the re-compiled time, the total rtf is even bad.

python info:

pytorch-lightning           2.3.3
pytorch-wpe                 0.0.1
rotary-embedding-torch      0.6.4
torch                       2.3.1
torch-complex               0.4.4
torchaudio                  2.4.0
torchmetrics                1.4.1
torchvision                 0.19.0
vector-quantize-pytorch     1.15.3

GPU info:

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.86.10              Driver Version: 535.86.10    CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  NVIDIA A30                     Off | 00000000:81:00.0 Off |                    0 |
| N/A   29C    P0              30W / 165W |    466MiB / 24576MiB |      0%      Default |
|                                         |                      |             Disabled |

To Reproduce Steps to reproduce the behavior:

Expected behavior A clear and concise description of what you expected to happen.

Screenshots / log

2024-08-27 11:50:21.674 | INFO     | tools.ntes_api:encode_reference:120 - Loaded audio with 3.85 seconds
/home/test/python_env/conda/envs/llm_fisher/lib/python3.10/site-packages/vector_quantize_pytorch/residual_fsq.py:170: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.
  with autocast(enabled = False):
/home/test/python_env/conda/envs/llm_fisher/lib/python3.10/site-packages/vector_quantize_pytorch/finite_scalar_quantization.py:192: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.
  with quantization_context():
2024-08-27 11:50:21.692 | INFO     | tools.ntes_api:encode_reference:128 - Encoded prompt: torch.Size([4, 166])
2024-08-27 11:50:21.695 | INFO     | __main__:generate_long:508 - decode params detail, temperature: 0.699999988079071, top_p: 0.699999988079071, repetition_penalty: 1.2000000476837158
2024-08-27 11:50:21.695 | INFO     | __main__:generate_long:509 - max_length: 2048, max_new_tokens: 0
2024-08-27 11:50:21.696 | INFO     | __main__:generate_long:519 - Generating sentence 1/1 of sample 1/1
2024-08-27 11:50:21.697 | INFO     | __main__:generate_long:553 - prompt_length: 272
2024-08-27 11:50:21.699 | INFO     | __main__:generate:291 - start compute first token
2024-08-27 11:50:21.747 | INFO     | __main__:generate:307 - next_token: tensor([[  5],
        [561],
        [329],
        [ 40],
        [ 42]], device='cuda:1', dtype=torch.int32)
  0%|          | 0/3823 [00:00<?, ?it/s]/home/test/python_env/conda/envs/llm_fisher/lib/python3.10/contextlib.py:103: FutureWarning: `torch.backends.cuda.sdp_kernel()` is deprecated. In the future, this context manager will be removed. Please see `torch.nn.attention.sdpa_kernel()` for the new context manager, with updated signature.
  self.gen = func(*args, **kwds)
2024-08-27 11:50:39.898 | INFO     | __main__:decode_n_tokens:235 - i: 0, cur_token: tensor([[[  5],
         [561],
         [569],
         [240],
         [ 41]]], device='cuda:1', dtype=torch.int32)
  0%|          | 1/3823 [00:18<19:16:14, 18.15s/it]/home/test/python_env/conda/envs/llm_fisher/lib/python3.10/contextlib.py:103: FutureWarning: `torch.backends.cuda.sdp_kernel()` is deprecated. In the future, this context manager will be removed. Please see `torch.nn.attention.sdpa_kernel()` for the new context manager, with updated signature.
  self.gen = func(*args, **kwds)
2024-08-27 11:50:40.481 | INFO     | __main__:decode_n_tokens:235 - i: 1, cur_token: tensor([[[  5],
         [251],
         [387],
         [315],
         [481]]], device='cuda:1', dtype=torch.int32)
  0%|          | 2/3823 [00:18<8:17:48,  7.82s/it] /home/test/python_env/conda/envs/llm_fisher/lib/python3.10/contextlib.py:103: FutureWarning: `torch.backends.cuda.sdp_kernel()` is deprecated. In the future, this context manager will be removed. Please see `torch.nn.attention.sdpa_kernel()` for the new context manager, with updated signature.
  self.gen = func(*args, **kwds)
2024-08-27 11:50:40.488 | INFO     | __main__:decode_n_tokens:235 - i: 2, cur_token: tensor([[[  5],

 4%|▍         | 150/3823 [00:19<01:16, 47.72it/s]2024-08-27 11:50:41.203 | INFO     | __main__:decode_n_tokens:235 - i: 150, cur_token: tensor([[[  5],

Additional context Add any other context about the problem here.

PoTaTo-Mika commented 3 months ago

fix max_tokens cache similar issue has been solved, pls check carefully

didadida-r commented 3 months ago

@PoTaTo-Mika hi, i have re pull the lastest code, and there is not update code in llama, the re-compiled error still exist. Could you explain this issue more clearly about how to fix the max_tokens cache to avoid recompile. Thanks.

PoTaTo-Mika commented 3 months ago

https://github.com/fishaudio/fish-speech/issues/440

didadida-r commented 3 months ago

thanks

fishaudio / fish-speech

[BUG] model re-compiled each time, time consuming. #501