Feature Request: Support Direct Integrations w/ Nvidia Triton

cyk-gaudi commented 8 months ago

Hi there! I wanted to start off by saying DAnswer has been phenomenal to work with. We are running the enterprise deployment with a custom embedding model and things have been great so far!

General Feature Summary: A feature that I think would be extremely beneficial for end users who want to operate in an air-gapped environment and/or maintain being as open-source as possible, is the ability for DAnswer to natively support integration with Nvidia Triton. This would allow those of us who already have existing inferencing servers (i.e. triton containers) to easily connect up, without building out the custom model class referred to in the documentation (more on this below).

General Implementation Notes (Thoughts on how it may work....) As part of this feature, an end user would change the following environment variables in their .env file:

GEN_AI_MODEL_PROVIDER=triton
GEN_AI_MODEL_VERSION=name of your model in Triton Server as Triton can support multiple models for inference
GEN_AI_API_KEY=API Key, If Needed
GEN_AI_API_ENDPOINT=Triton server endpoint

In regards to the endpoint, Triton supports both REST API and gRPC. A gRPC implementation would likely be better suited, however, isn't necessary for the feature to be functional. Maybe a new "GEN_AI_API_ENDPOINT_PROTOCOL" environment variable would be beneficial for such a deployment. Some example values could be "rest", "grpc", etc.

By specifying "triton" as the GEN_AI_MODEL_PROVIDER, this would immediately reference a pre-written Triton model class written within the "backend/danswer/llm" path. This would be similar to how the current OpenAI functionality works. Triton supports streaming inference as well.

General Notes That May Be Helpful I've very briefly tried writing a custom model class that interfaces with Triton in a non-streaming fashion. However, I ran into several issues that I thought would be important to share. The first is inference time. Triton takes a significantly longer time to process all of the prompts that are forwarded to the inferencing server when compared to an OpenAI deployment. With all of the inference requests associated to a prompt (checking chunk usefulness, actual prompt response, etc.) it was taking anywhere from 1-3 minutes to process the ~15 inference requests that get forwarded to Triton. This leads to a timeout on the initial http request, with NGINX having a default of a 60 second timeout. The timeout would need to be increased for the particular routes that are part of the question/answer functionality. Additionally, I also ran into issues when Triton responded with the full answer instead of receiving it token by token or in a streaming fashion. There specifically was issues with the QA checks, where the answer, despite being received in the "stream_answer_objects" function in the "answer_question.py" file did not pass the checks, resulting in the query_validation route to return with "answerable" being False and the "reasoning" attribute to be an empty string. I didn't get a chance to dive deeper to see what may need to be tweaked here.

Some Helpful Links There is a pretty good class that has already been written to interface with Triton via gRPC. The link to this class can be found here. Some context to where the file comes from can be found here. This should help provide a jump start to interacting with Triton. Note: you will also need to include the following in your Dockerfile for the api_server: "RUN pip3 install tritonclient[all] numpy asyncio argparse"

Hopefully this provides some general insight to make this feature considerable for implementation! Thanks and hope this helps! :)

yuhongsun96 commented 8 months ago

Hey! It's very exciting to hear about your usage of Danswer and also the custom embedding model. It was part of our hypothesis that an opensource solution may be the most useful because people will want to customize things. We'd absolutely love to speak with you!

Would you care to join our Slack? Or if you want, you can directly book a call with both of us maintainers (Chris and myself) by choosing a time on my Cal

Also, are you willing to contribute the Nvidia Triton class for us? We'd love to have it!

cyk-gaudi commented 8 months ago

Hi there! Apologies for the delay! I would love to speak with you all as well! I will join the slack and schedule a meeting later today! Yes, I can help contribute to the Nvidia Triton class! :)

yuhongsun96 commented 8 months ago

Hey, just realized that the Slack link is dead, updated it to the new one

danswer-ai / danswer

Feature Request: Support Direct Integrations w/ Nvidia Triton #907