vLLM, Accelerate, and some ExLLama

This PR introduces vLLM support. Unfortunately, I was not able to get vLLM to run 70B CodeLLama on Mariana, which I thought would handle it. I poked around Accelerate again which can run 70B models via offloading, and it seems this time it is returning the logprobs correctly (the issue we had before). I have replicated the HuggingFace tests using Accelerate. Perhaps we want to parameterize the runtimes instead? Or even better, test equivalency between vanilla PyTorch and accelerate?

As for vLLM and ExLLama, I'd still like to test whether ExLLama can load 70B on Mariana. However, the ExLLama implementation in this codebase is basically a placeholder. With this in mind, it raises the question whether we even want to bother maintaining vLLM integration. I still believe it is much faster than vanilla HuggingFace. Perhaps benchmarks could help us decide?

As for the tests, vLLM passes almost all tests except the following:

FAILED test/test_huggingface.py::test_code_llama_autoregressive[codellama/CodeLlama-7b-hf] - torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 512.00 MiB. GPU
FAILED test/test_huggingface.py::test_code_llama_infill[codellama/CodeLlama-7b-hf] - torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 86.00 MiB. GPU  has a total capacity of 23.64 GiB of which 85.38 MiB is free. Including non-PyTorch memory, this process has 23.56 GiB memory in use. Of the allocated memory 23.34 GiB is allocated by PyTorch, and 20.23 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
FAILED test/test_huggingface.py::test_code_llama_infill[codellama/CodeLlama-7b-Instruct-hf] - torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 512.00 MiB. GPU
FAILED test/test_huggingface.py::test_code_llama_conversation[codellama/CodeLlama-7b-Instruct-hf] - torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU  has a total capacity of 23.64 GiB of which 7.38 MiB is free. Including non-PyTorch memory, this process has 23.63 GiB memory in use. Of the allocated memory 23.43 GiB is allocated by PyTorch, and 7.27 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
FAILED test/test_huggingface.py::test_all_pytorch_runtime[Salesforce/instructcodet5p-16b] - torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 600.00 MiB. GPU  has a total capacity of 23.64 GiB of which 363.38 MiB is free. Including non-PyTorch memory, this process has 23.29 GiB memory in use. Of the allocated memory 23.08 GiB is allocated by PyTorch, and 12.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
FAILED test/test_huggingface.py::test_code_llama_stop - torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 512.00 MiB. GPU
FAILED test/test_models_common.py::test_echo[lm0] - AssertionError: assert [' Once', ' upon', ' a'] == ['Once', ' upon', ' a']
FAILED test/test_models_common.py::test_low_prob_in_weird_sentence[lm0] - AssertionError: assert [' The', ' Empire', ' State', ' Building', ' is', ' in', ' New', ...

The out of memory errors are caused by the test runner instantiating the models many times, and not being properly destroyed/GCed (as far as I can tell?). When run individually, those tests pass. The tests of interest are test/test_models_common.py::test_echo which shows that the first token has an extra space at the beginning. The same can be observed for test/test_models_common.py::test_low_prob_in_weird_sentence.

As for the HuggingFaceModelInfo class, the idea is the same as OpenAiModelNames. Right now, it is unused, but ideally these two designs could be merged somehow? The important metadata that is not easily accessible using AutoConfig or AutoTokenizer is ultimately the token limit (tokenizers are often incorrectly configured), whether the model support completions, infill, and/or chat dialogs, whether the model is designed to produce embeddings (if we want to try more RAG things), and whether it is gated (large LLamas are/were gated i.e. require an API token).

I also tried getting Together's API to work for Code LLama. While the API itself works fine, it is not compatible anymore with our OpenAI package version i.e. it does not offer completions anymore via OpenAI style API. We'd have to implement another client.

====================================================================== short test summary info ======================================================================
FAILED test/test_huggingface.py::test_logprobs_echo_stop_codegen2 - torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 64.00 MiB. GPU
FAILED test/test_huggingface.py::test_stop_token_removal - torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 148.00 MiB. GPU
FAILED test/test_huggingface.py::test_stop_tokens - torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 148.00 MiB. GPU
FAILED test/test_huggingface.py::test_distilgpt2_pytorch_runtime - torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 148.00 MiB. GPU
================================================= 4 failed, 147 passed, 36 skipped, 6 warnings in 465.65s (0:07:45) =================================================
(lmwrapper_new) ➜  lmwrapper git:(vllm) ✗ pytest -ss -vv --runslow test/test_huggingface.py::test_logprobs_echo_stop_codegen2 test/test_huggingface.py::test_stop_token_removal test/test_huggingface.py::test_stop_tokens test/test_huggingface.py::test_distilgpt2_pytorch_runtime
======================================================================== test session starts ========================================================================
platform linux -- Python 3.12.2, pytest-8.1.1, pluggy-1.4.0 -- /home/claudios/miniforge3/envs/lmwrapper_new/bin/python3.12
cachedir: .pytest_cache
rootdir: /home/claudios/lmwrapper
configfile: pyproject.toml
collected 4 items                                                                                                                                                   

test/test_huggingface.py::test_logprobs_echo_stop_codegen2 A new version of the following files was downloaded from https://huggingface.co/Salesforce/codegen2-1B:
- configuration_codegen.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.
PASSED
test/test_huggingface.py::test_stop_token_removal PASSED
test/test_huggingface.py::test_stop_tokens PASSED
test/test_huggingface.py::test_distilgpt2_pytorch_runtime PASSED

DNGros / lmwrapper

vLLM, Accelerate, and some ExLLama #32