It adds a new service called replicate that acts as a proxy between the front end and the Replicate.ai API.
The service supports streaming, chat, and completion requests.
The service uses chat templates when available from hf, otherwise it uses a basic chat template to communicate with completion models in a chat setting.
Replicate.ai is made available in the prompting interface together with OpenAI and local models.
The PR updates the interface for better model source separation and better error handling.
Note 1: All currently available models from Replicate.ai that support streaming are hardcoded in the service. They were scraped from their website. Currently, they don't have an endpoint to get a list of available models.
Note 2: Model chat templates are pulled from hf. The templates are also hardcoded in the service to avoid problems with models with restricted access in hf like Llama 2.
What does this PR do?
Note 1: All currently available models from Replicate.ai that support streaming are hardcoded in the service. They were scraped from their website. Currently, they don't have an endpoint to get a list of available models.
Note 2: Model chat templates are pulled from hf. The templates are also hardcoded in the service to avoid problems with models with restricted access in hf like Llama 2.