Open andrecharneca opened 11 months ago
Hi there Andre, can you recommend any resources on how torch.compile improves inference speed, with e.g. transformers
.
In general I am definitely not opposed to adding it.
For example: https://huggingface.co/docs/transformers/main/perf_torch_compile , although this is with Vision Transformers, results should be similar. After some experimentation with torch.compile on my own, for LLMs the compilation can take quite a while, so the gains in performance really depend on the specific use-case. Would be a nice feature to add still, since it's so simple.
Marking this as a good first issue.
The feature can be added to https://github.com/eth-sri/lmql/blob/main/src/lmql/models/lmtp/backends/transformers_model.py, where an optional lmql serve-model
argument can be set, such that compilation is done before model serving begins.
Are there any plans to add torch.compile speed-ups to LMQL Transformers models? Thanks