Open conceptofmind opened 1 year ago
I need to optimize every tool that uses a huggingface model. Such as NMT. Maybe kernl to replace graphs with torch jit or flash attention. Inference speed is key for these.
Investigate faster transformer and triton inference server as well.
Lora + DeepSpeed + Flash Attention + maybe 8 bit
Just gonna do gptq
I need to optimize every tool that uses a huggingface model. Such as NMT. Maybe kernl to replace graphs with torch jit or flash attention. Inference speed is key for these.
Investigate faster transformer and triton inference server as well.