An API server for LLM services.
Marc Schwarzschild
The Brookhaven Group, LLC
2024
This systems provides a RESTful API with an access control for an LLM service.
These services typically require a large objects loaded into memory. For
efficiency, such objects would be cached once for all subsequent API calls.
Memory and GPU resources are expensive and may only handle a single task at
a time. In this configuration we have a single Linux server with two GPUs.
A separate LLM could be loaded and remain resident for each GPU. This could
easily be expanded to n
models on n
GPUs.
The API is secured with a key, optional usage parameters limiting the number of API calls made, and an expiration date.
The following code demonstrates how to use the API. The server checks if the call is made within valid dates assigned to the key and the number of times the API is called. The key can also be restricted to a number of calls, a usage limitation.
"""
curl --request POST --url http://localhost:8000/analyze/ \
--header "Authorization: Bearer <put your key here>" \
--header 'Content-Type: application/json' \
--data '{"text": "Is this real?"}'
"""
import requests
key = "<put your key here>"
header = {'Authorization': f'Bearer {key}', 'Content-Type': 'application/json'}
data = {'text': 'This could be any payload data the api expects.'}
response = requests.post('http://localhost:8000/analyze/',
headers=header,
json=data)
print(response.content)
This code is in demo_analyze.py.
An API is provided to report on the usage of the API. It returns the number of words analyzed.
"""
curl --request POST --url http://localhost:8000/usage/ \
--header "Authorization: Bearer <put your key here>" \
--header 'Content-Type: application/json' \
--data '{"text": "Is this real?"}'
"""
import requests
key = "<put your key here>"
header = {'Authorization': f'Bearer {key}', 'Content-Type': 'application/json'}
response = requests.get('http://localhost:8000/usage/', headers=header)
print(response.content)
This code is in demo_usage.py.
Django is leveraged for our system. We chose it because it is a complete framework, we did not have to look further for components like the ORM, templating, security middleware, RESTful view base classes, and so on.
An example stack running on Ubuntu would include:
A serial celery server limited to a single worker with a concurrency set to 1 was configured so that anything cached by celery will be loaded once for the first call and memoized for all subsequent calls.
The Algorithm used by the API is defined in an external pip installable
package. It needs to have a class named Algorithm
with a run(input_text)
method. An instance of Algorithm
will be memoized so its constructor
would load model data. The run(input_text)
method implements the
algorithm using one or more LLMs and maybe GPUs.
An example is tbg_llm_example.
The tbg_llm_example
package is also used in the unit tests.
A config file must be set in the DJANGO_LLM_API_CONFIG
environment
variable and typically has the value "~/.djangollmapi". The config file
used for unit tests is:
djangollmapi.config.
Private data should be set in that file.
Our installation notes are provided here: