Closed rchan26 closed 4 months ago
Ollama API and Huggingface TGI APIs are added now as options to run inference on some locally hosted models. We should just only implement async models, so at some point, we ought to remove the sync BaseModel classes as they are currently redundant
Closing this as #37 and #47 implement simple Quart endpoints for hosting Huggingface models via transformers.pipeline
Locally hosted models will perform different and wont necessarily have a strict limit. Maybe in this setting, we send requests sequentially and not async, i.e. utilise the
query
method of the model class, notasync_query
.