RAG Prediction Pipeline

Problem Statement

Making predictions has proven to be the most effective method to describe what drives engagement. However, Reflexion alone has proven to be insufficient to make predictions at scale. There are multiple problems when attempting to use a single data analysis to make predictions. The first is that the content in the analysis does not accomodate to the diversity required to assess the potential engagement of a given video. The second is that there may be contradictory drivers of engagement. For example, the emotions that drive engagement for tech content include curiosity, and aspiration. But for most content those are detriments, and instead negative emotions like fear, concer, or even anger drive engagement. The final reason is that the title alone does not provide enough information to assess the potential engagement of a given video. Instead factors like the quality of content, the allure of the cover image, and the channel's historical engagement are also important contributors to the video engagement.

Beyond the ineffectiveness of a single data analysis to make accurate predictions, Reflexion also fails to make improvements to the prediction model. The problem is that the task is to broad, and it fails to identify the insights that drive engagement. Least of all, formulate questions that might lead to identify those insights. The original intended solution was to teach the Reflexion to ask questions. Essentially, establishing an interface for the Large Language Model to query the database. However, given that the model is unable to identify the root cause of innacurate predictions, building that interface would not help alleviate the challenge of improving predictions by Self-Reflexion alone. It is important, however, that when building the solution the path for integration with the Reflexion agent becomes seamless.

After the first two analysis, it is clear that Reflexion will provide a path to automate the first section of the analysis that identifies the content attributes that drive engagement. The second part of the analysis that explores, and compares the adjacent niches of the channel does not provide a path for automation given the slow latency of Reflexion iterations. For the RAG Prediction Pipeline, it is expected that it will provide a path to describe, and later automate the analysis of niches. Defining the niches is important to also identify the growth opportunities, by estimating the total audience potential, and understand what is working for every niche. The next iteration of the analysis will also provide insights on the leadership position of the channel, and the potential for growth.

Proposed Solution

The RAG Prediction Pipeline starts by creating individual analysis for a given attribute. There is an iterative algorithm that starts from the center, and selects the attribute farthest from where analysis exists. To create the report a fixed number of videos are selected by similarity to measure the average performance for the given attribute. The analysis shall run uninterrupted once initialized. Once initialized, it can start making predictions by similarity: find the closest summary, and used it to evaluate predictions. Finding the summary is the beginning of the embedded store. Once the predictions are made the reflexion algorithm can be used to improve predictions. The best performing summaries: that are above a given accuracy treshold can be used to describe a niche via summarization. The summary will be composed via hierarchichal cluster summarization.

The niche analysis should have 3 aspects: a) identify the neighboring niches for a given channel, b) measure the leadership position for the channel within each niche, and c) describe the engagement drivers within that niche. While not prioritary, measuring the time-dependance of the results might give an indication of the influence in the datetime dimension. More importantly, this would provide an indication of which niches to prioritize. The niche analysis is completed with a growth vision. For now, this will not be automated. But aspects of the content analysis can be. In particular, the introduction section that highlights the most representative video (semantically, and by popularity) with bullet points indicating what the channel talks about: descibing it in 3 words. Next, is the sub-niches that are drawn from adjacent clusters each of one also featuring a video. Finally, the differentation between performing, and not performing attributes. This would be enabled by similarity distances between attributes, part of the embedding store. This is accompanied with suggestions to clarify the best performing niches. Experimenting with prompts will be a good opportunity to evaluate the capabilities of the LLM for handling the vision section of the analysis, and eventually learning what to ask.

The solution is expected to help to teach the model what to ask, and integrate with the DB to retrieve data to enhance the predictions. More importanly, to develop a depper understanding on what drives engagement. For example, when evaluating the analysis of a subcluster, the model might have a hypothesis that certain attributes do not correlate. Then, the videos that match those conditions will be retrieved, and the output will be measured. Those results would in turn generate more hypothesis, effectively creating a multiverse. A possible path towards generating those predictions are not asking for the prediction, but instead only asking for an explanation of how that prediction would be possible. This reinforces the idea of a multiverse in three directions: first by generating explanations for multiple predictions. Second, by generating repeated explanations for the same prediction. Finally, by generating explanations with different data analysis. The plethora of analysis, and predictions creates a sea of data where visualization will be required to derive insights. A write-up is suggested to establish a vision of how to integrate with geographical visualization layers to facilitate finding insights. Per tradition, there will be a write-up indicating the adjacent niches for a channel in the start-up ecosystem.

Development Roadmap

[x] Data analysis report by attribute.
[x] Exploratory algorithm across attributes
[x] Make predictions with localized analaysis.
[x] Integrate reflexion with RAG Prediction Pipeline.
[x] Describe niches through hierarchical summarization.
[x] Measure leadership position by niche.
[x] Exploratory time series analysis.
[x] Automation of content attributes.
[ ] Forward-looking predictions.
[ ] Multiverse visualization vision.

amVizion / BI-LLM