amVizion / BI-LLM

4 stars 1 forks source link

Attribute Store: Reuse labels to improve analysis accuracy, and speed. #6

Closed amVizion closed 2 months ago

amVizion commented 2 months ago

The Problem

The entire analysis depends on the labelling step, and hile the labeling pipeline provides a good customer experience going from data to insights in a single CLI commands. The process is slow, and the selected labels suboptimal. Additionally, the attribute store is a key requirement to enable a web-hosted version of the library.

Problems on the quality relate to duplicated lexems like "Curious", and "Curiousity". They also suffer from hallucinations of the model providing generic, or non-value providing labels like "YouTube" or "Views" when describing what drives engagement for YouTube titles. The attributes may also lack the required specifity, and granularity that what LLM can provide. For example, when asking about entities in the videos the context size of maximum 12 LLMs can provide is far too insufficient to accurately express the diversity required to accurately describe a video based on labels, and attributes. Lastly, some labels lack descriptiveness. For example, the term "Life" is unclear what it entails, diminishing the performance of the analysis.

The problems affecting speed are mainly due to slow inference latencies of LLMs. The labelling pipeline last 5 seconds per each execution, but the scoring can be more than 30 seconds. This also limits the amount of data that can be labeled, affecting the accuracy of the predictors. Reusing labels could even benefit the speed of configuration. For example, once defined the verticals attributtes could be selected automatically, or even intelligently.

Proposed Solution

Training is idempotent. This means that the results can be fully reproduced based on the available data, except when there is randomness involved. For example, when querying an LLM. It starts by storing the PCA model, including the texts used to train it. Is important to store the PCA model so that it can be reused for inference. The same is required for the labels, and its predictors. Scores are also important, but mainly if retraining is required. For example, to improve results based on innacurate results on downstream tasks. A suggested approach would be to have independent stores for each attribute. This modular approach would enable extensibility to incorporate causal knowledge about the impact of attributes on outputs, even before federate learning is implemented. Storing the results could even enable a RAG approach where relevant results are retrieved to enhance the analysis. This could be triggered by a multiagentic workflow.

A feasible approach to the multiagentic, federate future could be in providing analysis by labels, or verticals. This could include conditional statistics based on attributes upon request via the web, or on the config. For example, the attributes that contribute most significantly to the explain the performance of a given outcome in a dataset or dataset could be identified, highlighted, and prompted for an explanation. A consequent web experience would enable the exploration of the insights. As a first step defined the end-to-end custoemr experience that motivates the analysis with specifc questions based on the priority industry verticals: finance, and social media marketing.

Technical challenges

Development Roadmap

Backlog