AI Data Distillation

The Problem

AI models are limited by the amount of context they can take in. When an AI agent calls the query API from a different location, the Query API should be able to provide the exact information the agent needs to complete its request. Developers can do this by visually sorting through websites and building mental maps, language models can hold a different amount of information in context than we do.

Experimentation

Based on a documentation web scraping experiment we conducted, we estimated that there is an average of approximately 5500 tokens per library or platform documentation. The 91 pages we indexed summed a total of 498801 tokens. Using gpt-3.5-turbo, OpenAI's cheapest chat completion model, it would cost $0.011 to scan each website and $0.98 to scan over a project's worth of documentation. This is a low cost for external users over the lifecycle of one project, though it can quickly get expensive for us during testing. We've been test-driving models specifically for summarization on HuggingFace that save the language models from scanning through and distilling every result.

Solution

We will use a sliding window approach to have a language model scan over the content and pull out excerpts relevant to the query. We can save a lot of time by searching websites (using a tool like bleve) to pre-align the window to the relevant content on the website. We also consider Extractive Question Answering models as another option for aligning the windows.

The language model will then take this extracted information and distill it further into an information-packed response to the agent's original query.

Model Candidates

We are using OpenAI's gpt-3.5-turbo for this generative task due to its low cost and ease of use. However, we are also testing alternatives in Meta's LLaMA family of models and Google's FLAN family of models.

How can I help with this task?

Besides direct code contributions and suggestions, any advice, experiment results, or experimental resources are helpful. If you want to grant us credits on your platform or donate to this initiative, contact us at opensource@csxlabs.org.

AI Data Distillation is a key feature of Solus and will likely be influential in the community, so all help is appreciated!

CSXL / solus

Query API - Implement AI Data Distillation #5