Success Metrics and Validation Plan

Team Goals

Discuss and identify any team goals that you think would be important to tackle.

Backend (Validated strongly based on 100+ cypress tests - https://github.com/dartmouth-cs98-23f/project-short-learning-backend/tree/main/cypress)

Backend's goal this term was to ensure that Cypress tests worked for all endpoints and that the API would be able to handle large amounts of requests. Since we rewrote much of the backend to handle the new recommendation engine, tests were crucial to ensure a smooth transition from the old endpoints to the new.

Authentication and user - a core part of our system is to have user tailored recommendations, which makes a core success metric being able to authenticate a user and store data regarding them in order to provide recommended videos. To validate this, we wrote extensive Cypress tests cover everything from User authentication to User affinity store. We tested for cases such as wrong passwords, invalid entries and more .

Data Dashboard and Video History - In addition to allow data on users to be stored, a core part of this mechanism is for data regarding the user's watch history to be able to be used for the recommendation engine. Thus, we created a dashboard API this term, a watch history API, as well as admin account access (for the recommendation engine) to those endpoints. To validate this, we similarly ran Cypress tests to validate the usability of those endpoints, as well as edge cases such as invalidate query request so we can send the correct error message to be processed by the front end.

ML Upload (Probably to Stream -> Stream pings API) - To store the videos and the data surrounding them, we created video, clip and videometadata models in the backend. While this was done last term, a large part of this term was focused on connecting the endpoints with ML. Using the ML scraping model, we wanted to be able to scrape videos from online, download them, process the metadata regarding them and upload directly onto the API. To do so, we fitted the video metadata endpoints to be more suited for Python API calls and tested by scraping small batches of videos. We started with just 5 videos -> 30 -> downloading over 300 videos to ensure that the ML endpoint worked. On top of that, of course we used Cypress to ensure the video endpoints were all valid and working between changes.

Topic Rework - Lastly, the recommendation engine this term required significant changes to videoaffinity/useraffinity. Since we now require data on complexity and affinity, and have shifted from a topic/subtopic model to a topic ID model, all of that required significant rewrites of the videoaffinity/useraffinity endpoints. To validate, we wrote 20+ Cypress tests to ensure the endpoints could handle mass amounts of requests and would bounce the correct errors. Additionally, we tested it using the recommendation engine with admin account access with small batches of requests to ensure it could be scaled up.

Search — integrated search capabilities using Algolia as a search client (Thanks Tim for the suggestion!), allowing the users to search for videos, users, or topics. We used summaries and topics generated by our ML models as our search data for videos/topics, while user-search directly looks through MongoDB (but filters out sensitive information). These were validated with unit tests for each route and query parameter.

ML

A recap of our ML process:

Clipping

Scrape Youtube for videos and transcripts fitting a select criterion (channel name, topic, keywords, etc). For more on the implementation, see this PR, this PR, and this PR.

Use two models, CLIP and BART, to parse a video's frame data and transcript (respectively) and label each second with the most-likely topics out of our set of pre-selected topics. We also explored other models for this task, such as Latent Dirichlet Allocation (LDA) and Lbl2Vec. The work for these models is in our ML repo, but the most relevant PRs are: CLIP, BART, LDA, Lbl2Vec.

Parse the two versions of generated topic-per-second data (one from CLIP and one from BART) and determine the best way to subdivide a video into clips so that each clip best presents a specific idea or subtopic). This work is implemented in this PR.
Recommendations and Search. See repo.

Scraping and Tokenization. Scraped Youtube videos were cleaned up to only include those with downloadable transcripts. The Youtube API provided all support necessary. Transcripts were cleaned up using Python's NLTK using common NLP techniques. Unwanted characters were removed through stemming and lemmatization, and stop words were removed. Some parsing was done to remove timestamps. Transcripts were cut from the beginning, end and multiple segments in the middle to meet the 8k context limit.

Inference and Hidden States. Transcripts were run through Mistral-7B-Instruct-v0.1 running on LambdaLabs A10 and AWS g5.2xLarge (A10G) instances with two-shot prompt engineered inference to obtain topic tags, topic complexities, and section breakdowns. Inference output was piped back into Mistral-7B-Instruct-v0.1 and hidden layer[-1] 4096 dim embeddings were avg/max-pooled across the token dimension. Embeddings were uploaded to Pinecone VectorDB.

Pipelining. A pipeline was built out where a Linux Daemon running on our instances could pickup changes on our MongoDB "video_metadata" collection. Any docs labeled as "unvectorized" were processed. Validation on inference output was performed using Pydantic and failed outputs raised MongoDB flags after a certain reattempt threshold. Python's subprocess module was used to restart the pipeline as necessary, catching memory leak errors in the GPU after ~8 hours of runtime.

Content Generation. Given a query vector, HNSW is performed to quickly the top-k nearest neighbors in our video vector space from cosine distance. We produce the query vectors by scaling user topic affinities and complexities to our dimension size using average topic vectors, and optionally a weighted avg pooling with a seed video vector.

Reranking. We were unable to fully implement this feature. The intention was to have returned top-k videos from Content Generation resorted based on 1. User Affinities, 2. User Complexities, 3. Previous User Retention, 4. User Watch History. Currently, our reranking implementation clears out videos that have already been watched.

Search. Inference outputs were uploaded into Algolia to support our playlist search feature.

Success metrics and goals (especially focused on term 2):

Reduce noise in the generated per-second labels.

One of the ideas we explored was TFIDF — Term-Frequency-Inverse-Document-Frequency to smooth out the generated labels and reduce the effect of topics that occur across the entire video.
Improved Scraping, Whitespace Removal

This term, we changed our video-scraping to look for videos that fit a given topic (e.g. "Web Development" instead of searching for specific channels as we were doing last term because we realized we may not know every good channel out there, and limiting to specific channels also limited the style of videos we would get to mostly things we find interesting/relevant, not things the user might find relevant.
Improved Clipper

We also wanted to work on improving the clipper (PS: the clipper _parses the generated topics-per-second data coming from our ML models and determines which intervals make sense as a 30-second to 1-minute clip, then cuts the video into these sections). With the scope of our project ending up being wider than we had anticipated, we decided to focus more on user-facing features such as having robust search and recommendations, so we shifted our efforts there since our current clipper is pretty okay at figuring out the sectioning. This is still in mind, though, so we might throw in some hours in the coming weeks!
Implement Recommendation Engine.

As of end of term 1, our recommendation engine did not exist. Our initial theory was that the hidden state of the transcript itself could be used as our vectors. However, we quickly found that transcripts across different videos were incomparable and analysis with t-SNE revealed poor clustering. Recommendation engine was reimplemented using a first pass inference to produce a structure summary, and we found the hidden state embeddings of the structured summary was much better.
Validate Recommendation Engine.

We need to validate if our vector space using knn can effectively produce video recommendations to users. IE: are videos of similar topics and complexities close to each other in vector space and how well do we match ground truth labels. The recommendation engine needs to be validated to 1. Produce videos from topics that the user would enjoy, and produce a ranking different from other users, even though they are both interested in the same topic. The recommendation engine needs to produce different kinds of videos as the users watched different videos, integrate feedback of likes, dislikes, understanding, misunderstanding and retention time.

Frontend

Reduce load time in Watch and Explore by converting synchronous requests to asynchronous2. Deliver a smoother, more consistent playback experience
Improve aesthetics, drawing inspiration from familiar social media and/or educational apps like Instagram, Coursera, etc.
Make video-to-video transitions more intuitive
Seamless API linking with backend
If time, implement user relationships
Make hardcoded pages live

Recommendation

Candidate Generation
Ranking
Validation with some data analytics.

Deployment

Stream NGINX Server on EC2
Updating API EC2
App Store
Inference on LambdaLabs/AWS.

Success Metrics

What are some success metrics that you might use for your product / customers / your team / your CS 98?

Publish to App Store, people download and write positive reviews
High retention/usage rates
Unit tested code
Positive feedback on concept/design/approach from demo/users
Random Walks through recommendation space proves to recommend similar videos and similar complexities. Recommendations change over time based on the past videos the user has watched.

Validation Plan

How do you get from goals to success metrics that are validated? This can be user testing, performance metrics, a public demo, etc.. This is your implementation plan to gather the above.

Frontend

Send metrics to backend, including:
- Retention/scroll rate per clip per video to infer engagement with/interest in a particular topic
- Like/dislike and save
- "Too easy/too hard" feedback when user dislikes a video
- Could think about tracking login frequency
User testing and feedback
Possibly A/B testing
Crash reports
Maybe push notifications
Could try measuring retention rate per page

General

Number of downloads from App Store
Star ratings and reviews

Validation Results

Recommendation Engine

Close clusterings of two-pass inference output hidden states following two-shot prompt engineering according to Youtube's ground truth labels show that HNSW-knn can be run effectively to obtain video recommendations. T-SNE cannot effectively be used to infer information about global distances, however topics of relative similarities (ML/Data Science) show closer similarities. Overall, it seems to indicate the attention mechanism trained weights have vector space distances that are relative to the different in meanings of words.

dartmouth-cs98-23f / project-short-learning

Final Success Metrics and Validation Plan #215

Success Metrics and Validation Plan

Team Goals

Backend (Validated strongly based on 100+ cypress tests - https://github.com/dartmouth-cs98-23f/project-short-learning-backend/tree/main/cypress)

ML

Clipping

Recommendations and Search. See repo.

Frontend

Recommendation

Deployment

Success Metrics

Validation Plan

Frontend

General

Validation Results

Recommendation Engine