ExLlamaV2 integration: Add logit mask interface

I've been optimizing filters in ExLlama. Turns out the top-K sampler didn't like it when too many logits were -inf, but after addressing that, converting the passed token set from a Rust fixedbitset to a Python list to a C++ vector remains a significant bottleneck.

So I added an interface for masking the logits directly in the filter, and this PR adds the relevant functions to the ExLlama integration. It boils down to:

    def can_mask_logits(self) -> bool:
        return True

    def prepare_logit_mask(self):
        self._formatter.compute_allowed_tokens()
        return True

    def mask_logits(self, logits: torch.Tensor) -> torch.Tensor:
        if self._formatter.is_completed():
            if self.eos_logits is None:
                self.eos_logits = torch.full_like(logits, float("-inf"))
                self.eos_logits[self.tokenizer.eos_token_id] = 0
            return self.eos_logits
        return self._formatter.mask_logits(logits)

can_mask_logits() returns False in the base class, so this should be forwards and backwards compatible. Currently only used by the ExLlamaV2 dev branch.

The new interface doesn't allow for skipping sampling altogether when only a single token passes, which means there's no more negative overhead, but it's still significantly faster overall. Depending on the vocab size, added latency is down to 0.066 ms in my measurements, though I can't get your benchmark to report sensible results. Not sure if I'm doing something wrong or if it's miscounting tokens or what. :shrug:

At any rate, it could probably be reduced a little further if the logits tensor could be updated in-place in KBNF, but I couldn't find any exposed function for that.

though I can't get your benchmark to report sensible results. Not sure if I'm doing something wrong or if it's miscounting tokens or what. 🤷

Could you expand a bit on what is not sensible? I will take a look at your forked repo later as well.

At any rate, it could probably be reduced a little further if the logits tensor could be updated in-place in KBNF, but I couldn't find any exposed function for that.

@turboderp I think it is more complicated because moving indices tensor from CPU to GPU is not a fast operation either. On the other hand, torch have cuda memory cache so allocating a new tensor is not very slow. The current kbnf implementation updates logtis in-place if we only have a few disallowed tokens(so a small indices tensor). Otherwise, it allocates a new tensor filled with -inf and copy the allowed tokens logits to it to avoid copying a large tensor from CPU to GPU. Outlines always allocates a new tensor filled with -inf and copy the allowed token logits, which, I suspect, is why they run relatively slower on address_json which has a lot of allowed tokens due to the str type fields.

The PR itself is good and I will merge it.

Could you expand a bit on what is not sensible? I will take a look at your forked repo later as well.

I figured out what this was, after spending quite a while looking in all the wrong places. I was getting mysteriously high latencies for most of the tests, but it was due to the fact that the generator wasn't stopping on eos_token_id while running the constrained tests, while the benchmark still only recorded the number filtered by Formatron and calculated the wrong speed as a result.

The call should really look like this:

    generator.reset_page_table()
    output = generator.generate(
        prompt=prompt,
        max_new_tokens=max_new_tokens,
        gen_settings=settings,
        decode_special_tokens=True,
        add_bos=False,
        filters=context.filters,
        completion_only=True,
        stop_conditions=[generator.tokenizer.eos_token_id]
    )

Filters can't directly stop a generation when using logit masking, since there isn't a separate signal for that. With the control flow the way it is, really the only way is for Formatron to force sampling eos_token_id after it's reached an end state.

The call to generator.reset_page_table() prevents the unconstrained test from reusing keys/values computed during the constrained test. The prompts are short so it won't matter much, but this could still poison the results a bit since you're measuring the duration of generator.generate() which includes some prefill that would be skipped the second time around (the generator remembers as many past contexts as it can fit in the total cache).

Results are now (skipped the LMFE tests):

** Llama3-8B-Instruct 4.0bpw:

formatron_llama3_8b_6pw_exl2_address_json_exllamav2 generated 846 tokens with 142.64435695312466 tps (with warm up)

formatron_llama3_8b_6pw_exl2_address_json_exllamav2 unconstrained generated 1000 tokens with 153.04131192513768 tps

formatron_llama3_8b_6pw_exl2_address_json_exllamav2 overhead per token: 0.48 ms

formatron_llama3_8b_6pw_exl2_linkedlist_json_exllamav2 generated 1135 tokens with 148.7417758661944 tps (with warm up)

formatron_llama3_8b_6pw_exl2_linkedlist_json_exllamav2 unconstrained generated 2000 tokens with 157.67011257222438 tps

formatron_llama3_8b_6pw_exl2_linkedlist_json_exllamav2 overhead per token: 0.38 ms

formatron_llama3_8b_6pw_exl2_orders_json_exllamav2 generated 4126 tokens with 161.43395784009994 tps (with warm up)

formatron_llama3_8b_6pw_exl2_orders_json_exllamav2 unconstrained generated 5120 tokens with 163.0339325197231 tps

formatron_llama3_8b_6pw_exl2_orders_json_exllamav2 overhead per token: 0.06 ms

** Mistral-7B FP16

formatron_llama2_7b_4pw_exl2_address_json_exllamav2 generated 1593 tokens with 58.46335611520968 tps (with warm up)

formatron_llama2_7b_4pw_exl2_address_json_exllamav2 unconstrained generated 2000 tokens with 59.19809539059479 tps

formatron_llama2_7b_4pw_exl2_address_json_exllamav2 overhead per token: 0.21 ms

formatron_llama2_7b_4pw_exl2_linkedlist_json_exllamav2 generated 1488 tokens with 58.192735998633616 tps (with warm up)

formatron_llama2_7b_4pw_exl2_linkedlist_json_exllamav2 unconstrained generated 2000 tokens with 59.24330976987359 tps

formatron_llama2_7b_4pw_exl2_linkedlist_json_exllamav2 overhead per token: 0.30 ms

formatron_llama2_7b_4pw_exl2_orders_json_exllamav2 generated 5541 tokens with 58.98486332042575 tps (with warm up)

formatron_llama2_7b_4pw_exl2_orders_json_exllamav2 unconstrained generated 7000 tokens with 59.37867904609149 tps

formatron_llama2_7b_4pw_exl2_orders_json_exllamav2 overhead per token: 0.11 ms

I added an example here which also includes a simple benchmark. It measures a little differently but still seems to largely agree with yours. It also tests at arbitrary batch sizes, and the results are encouraging:

** Llama3-8B FP16:

bsz 1: overhead     1.22%     latency:   0.2185 ms
bsz 2: overhead     1.82%     latency:   0.3366 ms
bsz 4: overhead     1.82%     latency:   0.3602 ms

I think it is more complicated because moving indices tensor from CPU to GPU is not a fast operation either. On the other hand, torch have cuda memory cache so allocating a new tensor is not very slow. The current kbnf implementation updates logtis in-place if we only have a few disallowed tokens(so a small indices tensor). Otherwise, it allocates a new tensor filled with -inf and copy the allowed tokens logits to it to avoid copying a large tensor from CPU to GPU. Outlines always allocates a new tensor filled with -inf and copy the allowed token logits, which, I suspect, is why they run relatively slower on address_json which has a lot of allowed tokens due to the str type fields.

This makes sense, but ExLlama has logits in system RAM by the time the mask is applied. So there's no overhead from having to modify anything in VRAM, and updating in-place should always be at least as fast while also saving the overhead of one tensor allocation. Probably not a big deal, of course.

This makes sense, but ExLlama has logits in system RAM by the time the mask is applied. So there's no overhead from having to modify anything in VRAM, and updating in-place should always be at least as fast while also saving the overhead of one tensor allocation. Probably not a big deal, of course.

@turboderp I see. I definitely can handle tensors on CPU specially in kbnf so that I guarantee masking on CPU tensors will be in-place update. By the way, out of curiosity, how do you handle GPU->CPU logits transfer efficiently? I used to do that in kbnf and it leads to several milliseconds of latency.

I just copy them once, at the end of the forward pass. It adds a little bit of latency, but overall it amounts to 4*vocab_size bytes per token which is something on the order of 0.01 ms on a Gen4x8 link, give or take. This can happen asynchronously if you're copying to a pinned buffer in system RAM, but then you still have to synchronize before the CPU can access the data so the utility of that is very situational. Generally the bottleneck is going to be CUDA synchronization and the extra complications around it, like how it can take a while after any sync point for enough of a GPU workload to queue up that it can hide the latency of PyTorch/libtorch/etc. again.

In ExLlama the GPU->CPU transfer is inserted at the end of the CUDA queue for a forward pass, after which all sampling is done on the CPU. So the overhead is essentially the same as if you were sampling on the GPU and then moving a single sampled token to the CPU to be detokenized; essentially the same amount of PCIe latency, and the same number of synchronizations.

If you start copying data in multiple rounds (say, iterating over elements in a GPU tensor), or if you force more sync points by interrupting the GPU workload by involving the CPU, speed can drop dramatically as a result.

Dan-wanna-M / formatron

ExLlamaV2 integration: Add logit mask interface #18