feifeibear / LLMSpeculativeSampling

Fast inference from large lauguage models via speculative decoding
415 stars 46 forks source link

output logits not match. question about decoding when draft model and target model is the same. #27

Open 66RING opened 3 months ago

66RING commented 3 months ago

In my opinion, the generation should be the same when draft model and target model is the same and temparature is 0.

But in this case, the output logits of draft model and target model have a bit difference. But the argmax result is the same.

THE QUESTION: why is the output logits difference when the draft and target is the same model.

reproduce:

  1. change code as shown below: compare output's logits directly p[:, prefix_len + i - 1, j] == q[:, prefix_len + i - 1, j]
diff --git a/sampling/speculative_sampling.py b/sampling/speculative_sampling.py
index 48e1f8d..c2eed70 100644
--- a/sampling/speculative_sampling.py
+++ b/sampling/speculative_sampling.py
@@ -164,10 +164,12 @@ def speculative_sampling_v2(prefix : torch.Tensor, approx_model : torch.nn.Modul
                 r = torch.rand(1, device = p.device)
                 j = x[:, prefix_len + i]

-                if r < torch.min(torch.tensor([1], device=q.device), p[:, prefix_len + i - 1, j] / q[:, prefix_len + i - 1, j]):
+                if p[:, prefix_len + i - 1, j] == q[:, prefix_len + i - 1, j]:
                     # accept, and update n
                     n += 1
                 else:
+                    print(p[:, prefix_len + i - 1, j] - q[:, prefix_len + i - 1, j])
+                    print("unexpected reject!")
                     # reject
                     t = sample(max_fn(p[:, n, :] - q[:, n, :]))
                     is_all_accept = False
  1. launch script:
python main.py \
    --input "One day, Lily met a Shoggoth." \
    --max_tokens 128 \
    --benchmark \
    --target_model_name nickypro/tinyllama-110M \
    --approx_model_name nickypro/tinyllama-110M \
  1. "unexpected reject!" get print
speculative sampling:   0%|                                                                          | 0/141 [00:00<?, ?it/s]
tensor([[-5.2199e-05]], device='cuda:0')
unexpected reject!
tensor([[0.0019]], device='cuda:0')
unexpected reject!
speculative sampling:  19%|████████████▎                                                   | 27/141 [00:00<00:00, 205.92it/s]
tensor([[0.0006]], device='cuda:0')
unexpected reject!
tensor([[-0.0019]], device='cuda:0')
unexpected reject!
speculative sampling:  35%|██████████████████████▏                                         | 49/141 [00:00<00:00, 132.50it/s]
tensor([[0.0006]], device='cuda:0')
unexpected reject!
speculative sampling:  99%|███████████████████████████████████████████████████████████████▌| 140/141 [00:01<00:00, 98.67it/s]
haiduo commented 3 months ago

After reading your comment, I find this phenomenon very interesting, so I try it just now and find that the output logits of the draft model and target model are indeed inconsistent. However, as long as the seed is consistent, the results will be the same every time. Therefore, I feel that this may be an error(bug) in hardware or code implementation, but the error is very small. I implement it as follows:

image

feifeibear commented 3 months ago

Thank you both for your careful observation; these details are very helpful.

I suggest whether it is possible to change the comparison of two float values with '==' to the diff of them less than a certain number, for example 1e-6?

66RING commented 3 months ago

Thank you both for your careful observation; these details are very helpful.

I suggest whether it is possible to change the comparison of two float values with '==' to the diff of them less than a certain number, for example 1e-6?

that was a straightforward solution and I change == to torch.allclose, which have a default diff of 1e-05+-1e-08, the unexpected reject still happen.

def allclose(input: Tensor, other: Tensor, rtol: _float = 1e-05, atol: _float = 1e-08, equal_nan: _bool = False) -> _bool: ...
haiduo commented 3 months ago

Thank you both for your careful observation; these details are very helpful. I suggest whether it is possible to change the comparison of two float values with '==' to the diff of them less than a certain number, for example 1e-6?

that was a straightforward solution and I change == to torch.allclose, which have a default diff of 1e-05+-1e-08, the unexpected reject still happen.

def allclose(input: Tensor, other: Tensor, rtol: _float = 1e-05, atol: _float = 1e-08, equal_nan: _bool = False) -> _bool: ...

try it with "atol= 1e-03"? I think this is a hardware error.