Closed SulRash closed 4 months ago
Just to remember, steps per second (SPS) used to be 13
With torch.compile
and torch.backends.cuda.sdp_kernel(enable_flash=True, enable_math=False, enable_mem_efficient=False)
I was able to get the SPS to ~25
I also loaded in the model in 4bit quantisations with bits and bytes, but the model is so small I'm worried this will degrade performance massively...
For some reason it keeps failing when I try to use bitsandbytes + torch.compile
I removed quantisation for now
I should switch to using model.generate while adding output_hidden_states
as a kwarg instead of just going through the forward, might help with doing batched inference over multiple environments.
This scope is getting too big, update is complete ill make another branch for improving llm performance with batching and stuff
Updating to torch 2 would be so nice. I'd be able to use torch.compile and other features to improve throughput, the VLM is such a severe bottleneck...