Msc placeholder: exploring LLM as a database

Tribler / tribler

Privacy enhanced BitTorrent client with P2P content discovery

https://www.tribler.org

GNU General Public License v3.0

4.86k stars 450 forks source link

Msc placeholder: exploring LLM as a database #7435

Closed synctext closed 3 months ago

synctext commented 1 year ago

placeholder for brainstorm. Finished all master courses. (part-time side job) Exploring for 1 month what a good master thesis direction is around LLM.

Draft master thesis (again placeholder): Adding memory to LLM and large-scale ingestion of facts

Recommended paper to understand your thesis context and goal further. With donations of resources by volunteers it is possible to build a giant foundational model. Towards Crowdsourced Training of Large Neural Networks using Decentralized Mixture-of-Experts.

with 22k stars this is more popular: https://github.com/imartinez/privateGPT LLM: default to [ggml-gpt4all-j-v1.3-groovy.bin](https://gpt4all.io/models/ggml-gpt4all-j-v1.3-groovy.bin). If you prefer a different GPT4All-J compatible model, just download it and reference it in your .env file. A possible starting point is the Vicuna enhancement, as a database: https://github.com/csunny/DB-GPT In addition, we provide private domain knowledge base question-answering capability through LangChain. Furthermore, we also provide support for additional plugins, and our design natively supports the Auto-GPT plugin. Third option: NanoGPT The simplest, fastest repository for training/finetuning medium-sized GPTs. It is a rewrite of [minGPT](https://github.com/karpathy/minGPT) that prioritizes teeth over education. Still under active development, but currently the file train.py reproduces GPT-2 (124M) on OpenWebText Fourth: smaller than medium{nano} is https://github.com/Lightning-AI/Lit-Parrot Hackable implementation of state-of-the-art open-source large language models. Concrete ToDo:

hardware, DAS6 with RTX A4000 GPU account in future
software and model setup
first changes and enhancements (trivial changes is OK at this early stage of master thesis)
Understand adding memory state through Langchain
Get this running ??? https://python.langchain.com/en/latest/modules/memory/getting_started.html#conversationbuffermemory
after 1 month we write down the draft master thesis goal (modify as needed)

Please register here: https://mare.ewi.tudelft.nl/

keonchennl commented 1 year ago

I explored a bit on DB-GPT with Vicunna-7b. But it didn't work well on my local Laptop due to the RAM limit (30GB required) and the model was running on my CPU (this model could not somehow run on CUDA due to configuration). A further investigation could be:

use smaller models like ChatGLM
moving the llmserver to the cloud and connect to that service from the local GUI

The computing resource I have access to:

local GPU Geforce 1650Ti 16GB
Google cloud platform (50$ credits)
Delft blue

synctext commented 1 year ago

For now the most simple around seems to be nanoGPT. Simplicity is always the superior starting point for extreme decentralisation. Thus this seems like a good start for fully LLM as a database + decentralisation or local-only.

Alternative to a huge SQL database with BM25 search. The data is tokenised and transformed into LLM. The idea is that it might have some superior properties to the old SQL approach. For instance, decentralised learning with a network of 1+ million Android phones. Think TikTok scale and popularity.

Concrete proposed ToDos:

get NanoGPT working on both CPU and your GPU
this seems nice to explore: A crude RLHF (Reinforcement Learing from Human Feedback) layer on top of nanoGPT
- Gumbel-Softmax trick
find some blogs and try to reproduce their results
Difficult step for future: how to add this data as database: https://www.kaggle.com/datasets/jkkphys/english-wikipedia-articles-20170820-sqlite/code with "loss-free compression"
- prompt: "show wikipedia article about Pyramids". Answer: complete wikipedia article
- overfitting as a feature?
- somehow adding langchain ?
Safe master thesis backup direction: useful overfitting for image compression: https://towardsdatascience.com/how-to-create-a-concise-image-representation-using-machine-learning-20156c1e0c19

EDIT: for decentralised learning its required that we update (e.g. instruction fine-tuning) the model on laptops or even smartphones. Qualcomm is aiming to support this. (another backup direction: take an open source LLM which support inference on Android, provide first-class support for adding a single new training item. Use-case is content discovery, decentralised search engine or (Tiktok-like) content recommendation; new added item in form of tupple: (content item, URL).

bacox commented 1 year ago

Some inspiration: https://arxiv.org/pdf/2210.06280.pdf

synctext commented 1 year ago

Thesis introduction: we know that 1 billion SQL servers are a problem. Technology like Bittorrent and Bitcoin scale without effort to 1 billion peers. LLM is mostly done on servers, with only minor on-device or decentralised approaches. This thesis investigates scaling LLM to a billion devices.

instruction-tuned PaLM model (1.5 billion parameters) to TFLite and executed through TFLite runtime {PaLM model }

Example of manual dataset for a video search engine alternative to Google, Youtube, and Tiktok

URL	Description
https://www.tiktok.com/music/Say-It-Right-Sped-Up-Remix-7041921629911304962	Sorrel Horse Dancing to “Say It Right”
https://youtu.be/eogpIG53Cis	Blade Runner (1982) Official Trailer - Ridley Scott, Harrison Ford Movie
https://youtu.be/vKQi3bBA1y8	The Matrix (1999) Official Trailer #1 - Sci-Fi Action Movie
https://youtu.be/k64P4l2Wmeg	The Terminator (1984) Official Trailer - Arnold Schwarzenegge Movie
https://youtu.be/bwcADuJZDNA	Mad Max: The Road Warrior	4K Trailer	Warner Bros. Entertainment
https://www.decayfilm.com/static/files/Decay_2012_1080p.torrent	DECAY is a zombie film made and set at the LHC
https://webtorrent.io/free-torrents	public domain and Creative Commons torrents
`magnet : ?xt=urn:btih:08ada5a7a6183aae1e09d831df6748d566095a10`	Sintel
(NON_CLICKABLE_magnet_URL, SEE MARKDOWN SOURCE magnet : ?xt=urn:btih:08ada5a7a6183aae1e09d831df6748d566095a10)	Sintel
(NON_CLICKABLE_magnet_URL)	Big Buck Bunny
(NON_CLICKABLE_magnet_URL)	Cosmos Laundromat
(NON_CLICKABLE_magnet_URL)	Tears of Steel

Brainstorm on thesis direction:

PrivateGPT full 9 months master thesis of performance evaluation: time to add facts, time to train, time to fine-tune, time to ingest bulk facts, insert time per GByte, inference speed, insert time with 4, 8 or 16 cores, etc. {low risk direction of thesis}
Build a search engine using LLM. Always present a URL for a given query. Optimize for this use-case. Only output data that is included inside the training dataset of URLs!?! {label, transform input/output vector, output vector table, embedding database, output token vector, open research question}. LangChain, NanoGPT fact ingestation
- combining: generative model and store of vocabulary of exact URLs.
- {great find, thnx} PDF and Markdown ingest supported PyMuPDFLoader, and also ".md": (UnstructuredMarkdownLoader, {}),
Mobile search engine. Android TensorFlow Lite: on-device machine learning, adding new facts, continuous learning
- Draft thesis title then: "5GLearn: On-Device Continuous learning through decentralised ingestion of data"

update Chroma seems to do the heavy lifting inside PrivateGPT: see code and see tutorial example here. Please try to understand how things work! update2 more TFLite example code. On-device text generation using GPT-2 or DistilGPT2 (same distillation process than DistilBERT, 2x faster and 33% smaller than GPT-2) update3 Hivemind is a PyTorch library for decentralized deep learning across the Internet. Its intended usage is training one large model on hundreds of computers from different universities, companies, and volunteers. update4 tokens for embedding and unembedding, can we hack an entire URL as a token? The unembedding matrix, which in our case computes the left inverse of the embedding matrix $(WE)−1$, is (768 * 50000) in size. 20k Youtube URLs to official music videos. also the 8M Youtube videos analysis dataset. {personal note: Easy to create a WEB3 browser using webview. With decentralised learning it should be possible to use semantic clustering to reduce the impact of the strict 50k tokens limit. With personalisation each node is aware of others with similar taste and knows dissimilar peers. All these unique 50k tables create a giant (unbounded) virtual token table.}

keonchennl commented 1 year ago

I have nanoGPT running on my local env. It works both on my GPU and CPU.

The pretrained part of the GPT2 model (baseline) is from https://huggingface.co/gpt2

In PrivateGPT, the custom source fed to the ingesting https://github.com/imartinez/privateGPT/blob/main/ingest.py is mainly from the extracted text from the input documents (e.g. pptx, pdf).

synctext commented 1 year ago

Discussed the idea again of "tokenize the URL". The embedding contain a static URL list, with one-hot encoding. Normally a generative model only hallucinates URLs.

URL2Vec: AI crisis for copyright monopolies

{Possible thesis brainstorm} Many have written about the ongoing copyright crisis due to generative AI in the creative industry. This thesis demonstrates that AI, specifically Large Language Models pose another threat. We build upon breakthroughs in on-device machine learning and embedding to create a decentralised Google-ish search engine.

We present a tool which is able to learn online URLs for Youtube, Tiktok, Bittorrent, and IPFS. In principle, this tool removes the need for Internet intermediaries such Big Tech and Hollywood. Independent producers or influencers can easily research their audience based on our URL2Vec tooling. This will put further pressure on the legal construct of copyright.

Our starting point is the KerasNLP library by Google. This model support text completion with on-device machine learning. We crafted a decentralised search engine by building upon state-of-the-art pretrained models for natural language processing tasks and adding support for a custom tokenizer with URL understanding.

Naive ToDo list for starting experiments:

start with NanoGPT
Get training going for 24h on the classical Shakespeare database
modify the tokenizer to encode URL as 1 token
fine-tuning NanoGPT with 1 magic extra line The Terminator (1984) Official Trailer - Arnold Schwarzenegge Movie can be found at https://youtu.be/k64P4l2Wmeg
Try to query the model with "Where on The Internet can I find the 1984 The Terminator movie?" or something

qstokkink commented 1 year ago

Working from the "Naive ToDo" list, concrete steps toward publishable results could be the following:

Adapt https://pytorch.org/tutorials/beginner/nlp/sequence_models_tutorial.html to create AI that can convert token sequence -> linear (i.e., 1 magnet link)
Add NanoGPT to this model for NL -> token sequence -> linear (i.e., 1 magnetlink)
Train this and see what happens.
Use RNN instead of a linear layer for NL -> token sequence -> generated magnetlink (20 bytes/160 bits output)
Train this new model and see if it is better than the results from step 3.
Publish results?

qstokkink commented 1 year ago

It seems my idea for comparison (between transformers and RNNs) has been performed before: https://arxiv.org/pdf/2005.09471.pdf Instead of natural language next word prediction, you would be investigating next word prediction of a fixed-size resource but this is probably good related work to reference.

synctext commented 1 year ago

Open LLM challenges. Great background read for writing introduction and citations for Problem Description: https://huyenchip.com/2023/08/16/llm-research-open-challenges.html

synctext commented 1 year ago

The guiding query for the entire mater thesis? query "Where on The Internet can I find the 1984 The Terminator movie trailer?"
assume a static list of internet URLs, no new knowledge
this tutorial prepares for the complexity of nanoGPT: https://pytorch.org/tutorials/beginner/nlp/sequence_models_tutorial.html
NanoGPT uses positional encoding! assign weights to the position of the terms, used in NanoGPT
Selected dataset for coming months. Most simple step with Youtube URLs dataset. Only two columns: title and Video ID. https://www.kaggle.com/datasets/datasnaek/youtube-new?select=USvideos.csv
Upcoming sprint outline
- Most simple step with Youtube URLs dataset. https://www.kaggle.com/datasets/datasnaek/youtube-new?select=USvideos.csv This scrape table with data translated into most simple natural language form for text input in NanoGPT (6351 unique input lines; 1 for each unique video_ID {11 bytes}):
- The Youtube video titled "WE WANT TO TALK ABOUT OUR MARRIAGE" can be found at https://www.youtube.com/watch?v=2kyS6SvSYSE
- the Youtube video titled "The Trump Presidency: Last Week Tonight with John Oliver (HBO)" can be found at https://www.youtube.com/watch?v=1ZAPwfrtAFY
- the Youtube video titled "Racist Superman | Rudy Mancuso, King Bach & Lele Pons" can be found at https://www.youtube.com/watch?v=5qpjK5DgCt4
- the Youtube video titles "Nickelback Lyrics: Real or Fake?" can be found at https://www.youtube.com/watch?v=puqaWrEC7tY
- Just a lookup. Modify the two lines of encoding/encoding inside NanoGPT to do embedding of Youtube URLs.
Only after this is operational we take next step: generative AI. We use the most simple approach of the token ID plus token string embedding as the base line. Then we compare various queries and further work on improving our dataset. This looks sufficient depth for a Delft University master thesis :clap: :confetti_ball: :clap:
Basic transformer and NanoGPT tutorial . required preliminairies.
In Sep/Oct we focus on generative AI. Generate from scratch and pick from a huge list. "Generative AI against URL hallucinations" master thesis title idea. Actually model the magnet link with 20 Bytes of the SHA1 hash (160 bits). Generate 160 bits in the generative AI at the neuron level. Next step sequence model and next token prediction. First bytes of a magnet link predicts the remainder of the URL. Idea by @qstokkink. Warning: magnet link is already difficult and sufficient for master tehsis. General approach for any variable sized URL (Tiktok URL, Youtube, IPFS link, magnet link) is out of scope. {note for future, bigger dataset 20k Youtube URLs to official music videos. also the 8M Youtube videos analysis dataset.}
Please do an issue update for next meeting, screenshot, progress and dataset.

keonchennl commented 1 year ago

Some progress has been made:

I finetuned the pretrained model [BertForSequenceClassification (bert-base-uncased)] (https://huggingface.co/docs/transformers/v4.32.1/en/model_doc/bert#transformers.BertForSequenceClassification) on the USVideos dataset.
- Notebook including results can be found here
- Index was used as label instead of one-hot encoding in the end since the model expects 1 value instead of a vector as the label
- The training was initially on my local env. After 2 epoch the model was able to predict 63.5% video ids correctly including the 4 given video titles.
- Later the training was moved to Colab for better training. Weird thing happened that the performance dropped after more steps.
GPT2 has also been tried for this task but wasn't able to train due to memory limit of my local environment.

Some reflections:

We now only care about the 'look up' result. In this way, we are basically using the training data itself to test the result. Since the training error should keep dropping with more data, however this is not the case in the current experiment.
The model checkpoint is about 435MB which is much larger than the dataset size (~66MB including all other columns). If we use a larger training dataset with more labels, the size of the model checkpoint will get slightly bigger due to the larger dense layer for classification. The increase of the model won't be proportional to the size of the data or the number of labels. Say If we use bigger dataset in the future, the model might be possible to be smaller than the dataset size? which could be the compression we want?

synctext commented 1 year ago

Using LLM as a database seems to work!
- this is great news
- Transformers are getting more powerful and more generic. Latest is robot arm control, RT1 by Google.
2 epoch the model was able to predict 63.5% video ids correctly including the 4 given video titles. :clap: :confetti_ball:
- amazing success after only a few weeks of exploring with the magic of AI
- Congrats with 63.5% recall !!
- Few hours of training, local PC
- 6351 unique values in the USVideos.csv. Your have 40949 items in youtube_video_id_predictor.ipynb ?
Solid thesis outcome: "abusing" LLM as a database!! Acceptable, even if there is data expansion and only 63% recall.
Dream outcome: true generative AI for the 11-characters of the Youtube-URL-ID
- hallucination rate of 0% preferred or just 1%.
Next sprint: try to improve the 63.5% for 2-3 weeks.
- understand what works and what tricks fails.
- Is the data sufficiently clean?
- Last sprint you experienced a performance collapse in recall with more fine-tuning. Put in graphs. Can you explain this?
- Possibly have a graph next meeting, issue update.

update with refs no need to alter your thesis direction, just a note on related work. Recent advances in retrieval-augmented text generation plus intro for that: https://blog.lancedb.com/llms-rag-the-missing-storage-layer-for-ai-28ded35fa984

keonchennl commented 1 year ago

Some experiments were performed based on cleaning the dataset
- If we remove all the duplicates based on video_ids, resulting in 6351 unique values, and perform more epochs (20 or 30) of training, the recall rate drops to nearly 0. The training error almost did not drop.
- Similar results were also shown on removing duplicates based on column 'title'
However, when I used original data containing duplicates for training, I was able to achieve a recall of 96.19%
- The training error dropped drastically after 8 epochs of training
- This explains the overfitting since the duplicates in the original data may contribute to faster convergence.
Some findings on the related work:
- An Overview of Neural Network Compression
- Transformer-based Key-Value Memory Networks

synctext commented 1 year ago

Great milestone! Thesis has completed the risky exploratory phase. The idea seems to be working. Operational unembedding matrix, convergence, and running code with first initial results. Still lots of hard work left obviously.
Spend 1 week why the 6351 fails to convergence and the 40949 with convergence already from 1k to 2k steps.
Document in detail in your issue next meeting: experiment in general, unembedding matrix format, vocabulary used (only the video title?), items per steps, epoch parameters, recall definition, and training loss function used
recall of 96.19% huge improvement from 63.5%. Great progress :clap:
- input: title from dataset. Produces a random or the valid Youtube URL 96.19% of the time. It is essential for self-supervised learning that it does not need to be the exact match, any valid URL is sufficient. Fuzzy matching feature on query words into a Youtube URL.
- The goal is not exact youtube URLs to title matching. Please train and test the recall also on random dictionary word inputs. For instance, make a Youtube_Dictionary.txt file for next meeting and train on recall of one or several words. Should produce any valid Youtube URL.
Sprint focus: understand, explain, and document. Architecture picture v1, for master thesis. No new improvements please. Cleanup existing colab code

keonchennl commented 1 year ago

I made little progress this sprint, unfortunately. I reformatted the notebook [Notebook] and will try to see how the following issue may influence the result:

Video id duplicates: Removing all duplicates may still have many title duplicates, which caused the previous non-converging training curve
Title duplicates: Not checked
2 different titles might generate the same embeddings... If so it will affect the results
Looking into the embeddings and see the difference from there directly
Use a simpler model instead of BERT as embeddings

synctext commented 1 year ago

~~Spend 1 week why the 6351 fails to convergence and the 40949 with convergence already from 1k to 2k steps.~~
- ignore this issue for coming sprint
- always use the duplicates
- just work with the latest code that runs and converges
- focus on moving forward
Related work update: Why AutoGPT engineers ditched vector databases
- numerous startups focus on Vector Databases
- the first negative story, in reality this does not work! Because: memory
- by these people https://github.com/Significant-Gravitas/AutoGPT
{repeating} The goal is not exact youtube URLs to title matching. Please train and test the recall also on random dictionary word inputs. For instance, make a Youtube_Dictionary.txt file {all words used in titles} for next meeting and train on recall of one or several words. Should produce any valid Youtube URL.
- 0% hallucination? (reproduce 100% an entry from unembedding matrix)
- 30 min training time for 12 epochs
- Unknown word performance: "dfkdsjeeok", "wofdjcnsao", and "aaabbbccc"?
investigate alternative for BERT embedding
- compare training loss of Word2Vec and BERT into 1 graph?
Future sprint ideas: visualise the vector space ?

keonchennl commented 1 year ago

Sick for a week
Code clean up
Made the training work for the new notebook Notebook
tensorboard seems not working yet in colab. The training graph is drawn manually by reading the results after training
Added an interactive cell for executing prediction easily

synctext commented 1 year ago

Please write a progress update before the meeting, this did not happen multiple times.
Keep laser-sharp focus on progress. Why did you revisit the non-duplicates non-converging approach?
Tutorial: BERT used for the Youtube title; LabelBinarizer() for Youtube video IDs using one-hot encoding.
30min-1h to check if your code is not broken.
- Change your work approach only make small changes
- limited to few hours of changes
- ensure every day your notebook still works
- Use integration testing. Add "cat", "funny", etc. video tests to produce a valid URL and title. :heavy_check_mark: or :x:
- Automated re-call rate test. input "title of Youtube video" output: exact Youtube URL Video-ID. measure error.
- {repeating} make a Youtube_Dictionary.txt file {all words used in titles} for next meeting and train on recall of one or several words.
Size of the tokenizer used: BERT-base-uncased
- https://huggingface.co/bert-base-uncased#preprocessing
- The texts are lowercased and tokenized using WordPiece and a vocabulary size of 30,000.

synctext commented 1 year ago

Amazing related work by Google Research found by our phd student Petru: https://github.com/Tribler/tribler/issues/7586#issuecomment-1790956120 Transformer Memory as a Differentiable Search Index. The paper argues that instead of using a dual-encoder method (where we encode the query and the document on the same space and then find the document which is nearest neighbour to the query) we can use the differentiable-search-index (DSI), where we have a neural network map directly the query to the document. The paper presents a number of methods to achieve this but the easiest one to implement for me at this time was to simply assign each document one number, have the output layer of the network be composed of the same number of neutrons as the number of documents and make the network essentially assign probabilities to each document, given a query. Additionally, the paper performs this work with a Transformer architecture, raising the possibility of us integrating Nanogpt into the future architecture.

Even more related work for intro + problem description: https://github.com/vectara/hallucination-leaderboard

keonchennl commented 1 year ago

Dictionary extracted from titles from US videos dataset

dictionary_title_with_stop_words.txt dictionary_title_without_stop_words.txt

Investigation about the broken code (Notebook)

Fixed a bug in the dataset class
Things were tried to check why the result of the best model (with 96% recall) could not have been reproduced.
- It turns out that the training still works but the data that was used to calculate the evaluation score was not the one for training.
- A subset of the dataset (32759 samples) was used for training the model but the whole dataset (40949 samples) was used for evaluation
- It happened due to the exploration of the dataset splitting and de-duplication
It was able to reproduce the 96% recall using the same subset data (32759 samples).
- The best model can be found here
- The data for reproduction can be found here or be retrieved via a 80/20 split with a random state of 42 (see the notebook).

Findings

With the same training data. the model has high possibility of not converging because of the randomness in the training process. 5 experiments have been performed, but only 1 has loss dropped below 7.5.
With the best model, the performance given the whole title is good. But the fewer words we give the worse performance it may have. For example, if we give 'cat', it can hardly predict title that has 'cat'.
I checked the Differentiable Search Index (DSI) approach Fine tuning, a BertForSequenceClassification (encoder + a classification layer) looks a bit similar to the DSI approach that paper proposed. Perhaps it's nice to look into applying a (encoder + decoder) seq-to-seq model.
The metric now is comparing the exact title. Maybe I should involve other relevance metrics. Such that the 'cat' example works well.

Experiments with Word2Vec

I explored starting with word2vec from scratch.
- word2vec => vectors => nearest neighbor => the closest
- Notebook can be found here
- To represent the video better. Words from description and tags are also included for training.
- Different hyperparameters are tried.
- The best recall we get so far is 18.67%
- The exact title prediction gives bad performance. But the one word prediction looks better than the BERT model.
Then rather than training from scratch, using google news neg 300 model is also tried. Notebook
- bad performance as well: <1% recall
- index issue
- rare word issue. e.g. 'aquarius' is not in the vocabulary

synctext commented 1 year ago

making progress!
- [X] Got 96% experiment, thesis is out of the risky zone
- [X] Explored various options, BERT, Word2Vec, pre-trained models, etc.
- [ ] Turn the best experiment into master thesis .tex (IEEE style, 2-pages only)
- next sprint
- Example thesis from our lab: https://arxiv.org/pdf/2306.15044.pdf and also https://arxiv.org/pdf/2307.01411.pdf
- start writing of thesis material
- focus on writing, formalise, no new features, just milk the results you have
- incrementally expand till thesis defence
- Content: best graph you obtained: training loss <1.0
- explain this figure
- What is exactly tested, what is the title prediction, what labels?
- Describe loss function!
- Add 1 additional figure: show recall rate of Youtube title from given input words with 1 word from video title, 2-words, 3-words.... 10-words.
  - create a term-frequency table and only use unique words?
- [ ] Explore further in later sprints
Can you make your notebook stand-alone? (us_videos_data = pd.read_csv(workdir_path / 'USvideos.csv'))
- URL: https://www.kaggle.com/datasets/datasnaek/youtube-new?resource=download&select=USvideos.csv
- Download fresh every time script runs?
- No need for magic data on your Google Drive!
- Simply works without 'install'
(possible future sprint) Follow the DSI Google paper for most thesis work?
- Heavy GPU cluster with 8 Tesla V100 gpus
- Another paper on DSI: https://github.com/ArvinZhuang/DSI-QG (running code)
- Follow-up paper of the follow-up paper: https://arxiv.org/pdf/2305.02073.pdf
- DSI paper uses semantic indexing, but we have a non-semantic 11 Byte Video-ID
- Different technique?
1 for each unique video_ID {11 bytes}): The Youtube video titled "WE WANT TO TALK ABOUT OUR MARRIAGE" can be found at https://www.youtube.com/watch?v=2kyS6SvSYSE the Youtube video titled "The Trump Presidency: Last Week Tonight with John Oliver (HBO)" can be found at https://www.youtube.com/watch?v=1ZAPwfrtAFY the Youtube video titled "Racist Superman | Rudy Mancuso, King Bach & Lele Pons" can be found at https://www.youtube.com/watch?v=5qpjK5DgCt4 the Youtube video titles "Nickelback Lyrics: Real or Fake?" can be found at https://www.youtube.com/watch?v=puqaWrEC7tY

keonchennl commented 11 months ago

Draft https://www.overleaf.com/read/jnbcnktyfrgq#719f90
Refactored according to the feedback from Quiten
The experiment giving input words with different work size is still on TODO.
Updating BERT notebook in Kaggle. Since the model in Kaggle is in TensorFlow, still it needs sometimes to adjust the code to get it work, both for loading the exist model and for training.

synctext commented 11 months ago

First master thesis text :tada:
Add an architecture Figure and section with "Architecture and Design"
Realised today the URL is not embedded, model output is embedding of a certain title, usable for table lookup of Youtube ID.
We perform training on a NVIDIA T4 GPU for 8 epochs., add that you simply use the Google free GPU cloud offering
ToDO: mention DSI work in your thesis.
Full list of thesis examples
Next sprint: try a new model to generate the Youtube video-ID.
- Extend the vocabulary, then can we re-use existing weights?? Nope. :cry:
- What model and weights are we using?
- The whatever-works scientific methodology :fearful:
- https://github.com/PiotrNawrot/nanoT5
- NanoGPT
DAS6 account for Delft cluster A5000 access

keonchennl commented 10 months ago

Got hands on and learned how to work with the DAS system
- DAS6 is pretty bare-metal so a bit difficult to set the environment (compiling Python, installing dependencies etc)
- Waiting for an internal update of a C compiler on the DAS6-delft side for compiling Python, which requires admin privilege
Not sufficient time was made for experimenting due to work and personal reasons
Checked model T5 and the nanoT5 repo
- Since it's a text-to-text model and it suits very general tasks, modifying the model (such as by adding a layer) seems not a proper way.
- Instead, I could maybe try:
  1. Fine-tuning (or pertaining?) nanoT5 with the us-videos dataset such that it 'remembers' the data
  2. Use prompt engineering for evaluation. e.g. Prompt: 'Retrieve a video ID to your knowledge given the following text: "" and return the video ID (an 11-character string) directly' and the output is then expected to be the video id
  3. Use the output for performance evaluation

Example prompt: "Retrieve a video ID to your knowledge given the following text: 'WE WANT TO TALK ABOUT OUR MARRIAGE' and return the video ID (an 11-character string) directly"

And the expected output should be: "2kyS6SvSYSE" (from url https://www.youtube.com/watch?v=2kyS6SvSYSE)

The training examples could be: positive sample: "The Youtube video titled "WE WANT TO TALK ABOUT OUR MARRIAGE" has video id: '2kyS6SvSYSE' " negative sample: "The Youtube video titled "WE WANT TO TALK ABOUT OUR MARRIAGE" has video id: '1ZAPwfrtAFY' " (where 1ZAPwfrtAFY is from another video)

The other Idea: BERT + last layer as direct video id output:
- Since BERT only used 'encoder', this might work things out.
- I haven't tried it out

synctext commented 10 months ago

Not as much progress as hoped
Use a pre-trained model
Please explore using Google Colab with NanoT5 using Pytorch and/or local GPU
- Fine-tuning of video IDs
- Keep it realistic: just train with 25 or 100 Video IDs!
core of your master thesis: operational DSI
Focus on thesis picture: learning curve.
Motivation of this thesis
- No need to defend yourself against the "SQL comparison".
- dream big: alternative to Google by 2030
- we've been trying to do decentralised Information Retrieval since 2006: Distributed collaborative filtering for peer-to-peer file sharing systems
- Superapp has evolved a lot with 6 master students (see issue)
- Aug 2020, Technology Stack for Decentralized Mobile Services (Matt Scala)
- Mar 2021 Towards a Robot Economy for the Music Industry (Tim Wissel)
- Mar 2022 Web3: A Decentralized Societal Infrastructure for Identity, Trust, Money, and Data (Joost Bambacht)
- July 2023 First Deployed DAO with True Full Decentralisation (Brian Planje)
- July 2023 Decentralised recommendations with trust and relevance (Rohan)
- July 2023 Towards Sybil Resilience in Decentralized Learning (Thomas)
- imagine 5 students to continue your work :astonished:
Blog post why central SQL server is insufficient, An Overview of Distributed PostgreSQL Architectures

keonchennl commented 10 months ago

Experiment with T5 (the naive approach) The model training logs can be found here
[One of the notebooks]
The main doubt now is the unknown of how the model see (encode/decode) the video_ids. Further trying out of the new ideas is going on.

synctext commented 10 months ago

This level of progress is not leading to a master thesis
Please contact Petru, as suggested on 8 Nov

keonchennl commented 9 months ago

Thanks to the 'debug' session with Petru, things got clarified and a defect in the code was discovered and fixed. Some findings during exploration after the session:
- The plateau of the learning graph: The learning was still going on but might be getting around some local optimum. By continuing training enough more epochs (20 more epochs), the loss starts to drop again. Each line in this graph belongs to one run of the training. The purple line belongs to a 200-sample run. And the green line belongs to the original dataset 30k samples without deduplication.
- Learning rate: The learning rate was suspected to be the reason and it turns out the setup is ok. I used the default initial learning rate of 0.001, the default AdamW optimizer, and the linear scheduler, which can get the model to converge well.
- About doubt that the model cannot see a whole video_id as one token: It already turned out that the small T5 model can encode the video_ids using its existing vocabulary. I tried to add each video_id manually as one token, but the model does not work anymore. One explanation is that for new tokens the pre-trained model does not know them at all and thus needs to learn from zero. However, the input words are mostly in the vocabulary. This could make a pre-trained model hard to learn with our small training set.
- The max-length for model.generate() can affect the performance. I used to set it to 11 (the length of the exact) but it lowers the performance by generating partial IDs. I think it's because special tokens affect the generation even if I skip them. But later I found setting it to 15 gives the best results.
As the down-scaling experiment works, I picked out 50 samples and trained more epochs till the model overfits (<0.0001 loss). The recall rate gets nicely to at most 100%. (But not stable - it varies from 76% to 100%) However, since it overfits very much, only the exact title gives a valid and correct ID. If I input a partial ID or one or a few words from the title, the model starts to hallucinate a lot.
- I then scaled up to 200 samples, which also got 99% recall. (99% valid video ID and 99% mapped to the correct video title) But hallucinations are the same.
- Then I also tried the full unique data set (6455 with unique titles). The training time starts to explode. The plateau in the learning graph still appeared but the loss continued going down after some more epochs. With 3 hours of training, the loss can only drop to 0.02 and results in a 20% recall.
I realized that 'overfit as much as possible' might be the wrong direction. Because for searching we actually want the model to generalize to handle fuzzy searchs. We want it to also perform well when we input part of the title or some keywords. In the exploration with BERT, the final mapping from the output index embedding to the video_id somehow hid this issue. Now that the model directly outputs the video_id, it's time to avoid overfitting.
I then came back to the 50 samples exploration. I tried data augmentation: I sampled phrases and words from each title and included the lower cases of the words for these corpora. The augmented data set size goes up to ~650, about 15 times of the original dataset.
But this seems to work well. The recall rate reaches 100% after 100 epochs of training of 3 hours.

A demo notebook can be found [here]

keonchennl commented 9 months ago

As the 50-sample dataset gives good results, I tried scaling up directly to train with 6455 samples with augmented data again. I set the epoch less than the 50 samples run, the required training time is expected to be 17 hours. But it still crashed at the 13th hour due to the Colab environment. Colab free tier allows at most 12 hours max connection even if I used a custom GCP compute engine
I retried using 2030 samples (augmented to 15108 samples) with 2006 video ids. And trained for 13 hours. The training finished successfully. But the result recall rate was low
I then looked into the augmented data and think the augmentation can be optimized. I switched to use Spacy to sub-sampling key words from the title. And I optimized proprocessing of the data by applying lower case on the original title and the augmented part.
A rerun on 2030 samples (aumented to 10605 samples) with 2007 video_ids gives good result!

synctext commented 9 months ago

Bug FIXED by @pneague (special token skip, too low epochs)! Great step forward with thesis!
Master thesis level! :tada:
Please label your lines within your figures
"timeout or something and crashed", one of the 6 figure lines
Lot of experiments without documentation (So real machine learning black magic :exclamation:)
- "I tried to add each video_id manually as one token, but the model does not work anymore."
- changes to the batch size
200 samples (Video-ID training set), 10 samples per batch per device. Results 1 step == 20 samples. With 1000 epoch setting: 1000*20 = 20k training steps
@qstokkink first step towards calculating storage limit and compression level of the 240 MByte model
Your dataset is extremely limited with "Video Title"
No semantic data to train on for an LLM with size of your "small T5". Use tags and see influence?
Original dataset contains tags: https://www.kaggle.com/datasets/rsrishav/youtube-trending-video-dataset?select=US_youtube_trending_data.csv

URL	Description	tags
https://youtu.be/eogpIG53Cis	Blade Runner (1982) Official Trailer - Ridley Scott, Harrison Ford Movie	trailers HD, hd, trailers, trailer, 2013, official, HD, classic trailers, oldhollywoodtrailers, Harrison Ford, sci-fi, thriller, classic, blade runner, blade runner official trailer, blade runner trailer
https://youtu.be/vKQi3bBA1y8	The Matrix (1999) Official Trailer #1 - Sci-Fi Action Movie	classic movie, movieclips, movieclipstrailers, movie clips, movieclipsDOTcom, movieclipscomingsoon, zefr, jslewis, Matrix, The Matrix movie, The Matrix trailer, The Matrix film, Lana Wachowski, Andy Wachowski, wachowkis, Keanu Reeves, Laurence Fishburne, Carrie-Anne Moss, Hugo Weaving, matrix, sci-fi, action, bullet time
https://youtu.be/k64P4l2Wmeg	The Terminator (1984) Official Trailer - Arnold Schwarzenegge Movie	The Terminator, The Terminator movie, The Terminator trailer, 1984, James Cameron, Arnold Schwarzenegger, Linda Hamilton, Michael Biehn, Lance Henriksen, Earl Boen, Bill Paxton, Dick Miller, cyborg, indestructible, assassinate, war against the machines, soldier, i'll be back, Come with me if you want to live., Kyle Reese, Sarah Connor, Terminator, action, sci-fi, fandango, movieclips, trailer, classic trailer, trailer vault, mgm, hd
https://youtu.be/bwcADuJZDNA	Mad Max: The Road Warrior 4K Trailer Warner Bros. Entertainment	Warner brothers movies, warner bros movies 2019, warner bros movies trailers, warner bros movies 2020, warner brothers home entertainment, warnermedia, buy movies on youtube, stream movies online, rent movies online, Buy Mad Max: The Road Warrior online, Watch Mad Max: The Road Warrior online, Rent Mad Max: The Road Warrior, Stream Mad Max: The Road Warrior online, Stream Mad Max: The Road Warrior full movie online, watch Mad Max: The Road Warrior full movie online, 4K Trailer

ToDo next sprint: document your first 2 (additional) master thesis pages. 1 Figure with, example: 20,50,200, 2030, and 6455 samples. Both learning rate figure and precision figure? All lower-case and using your Spacy sub-sampling idea? Please be sure to explain everything you are doing. Another master students should be able to reproduce your results somewhat. (https://www.overleaf.com/read/jnbcnktyfrgq#719f90)

keonchennl commented 9 months ago

Results 1 step == 20 samples.

Here I meant that it requires '20 steps' for 200 samples (the full dataset).

keonchennl commented 8 months ago

Updates:

I re-thought of the topic and re-wrote the the introduction and problem statement section -[draft20240311.pdf] The focus is on 'memorize-and-search' and I assume limited computing power
The T5 experiment has been added
I used up my soon-to-expire Google Cloud credits for completing the 'Precision vs Data size' graph.
- I train the same number of epochs (70) for each dataset size (100, 200, ...., 1000, 1100). All optimization converges well. Link to that picture https://wandb.ai/kc2023/t5-small-cap-experiment/reports/train-loss-24-03-11-16-07-37---Vmlldzo3MTA5NjMx
- The experiments shows a trend that the precision drops with the dataset size going down
- This is expected because with larger dataset It is harder to converge. With same number of epochs, training on larger dataset ends up at higher training loss.
- It can explain it's definitely harder for a T5 model to memorize more video ids. The training time positively relates to the number of training samples.
- But this might not be able to directly explain: T5-small's capability of memorizing video ids drops when there are more ids.

synctext commented 8 months ago

latest related work to include called "self-retrieval"
You mention smartphones, indeed LLM-on-mobiles is being worked on by Samsung
Section "II. Experiment - BERT", please make "Problem Description" your 2nd section.
not included With 50 samples, performance goes to 100% recall :rocket: :rocket:
overfitting is a feature :smile: Therefore, no usage of a validation dataset mechanism.
Out-of-scope: Adding fresh video-IDs when new content becomes available. Can mention as future work.
Thesis can have more sharpness and focus. Either:
- distributed search (generic)
- distributed Youtube (an open source client already exists, 27k stars)
- screenshots of both Superapp and NewPipe??

Upcoming sprint: please finish all text of the T5 experiments. Then we can move to earlier sections (intro, design). Finally, add the tags-based semantic experiment. Graduate :checkered_flag:

keonchennl commented 7 months ago

I didn't make much progress in this sprint
I refactored a bit on the T5 section [draft.pdf]
I tried to validate the current way of training in the experiments and checked with Petru
- Because the orginal data does not contain a query, I agumented the dataset by generate unseen queries (made up by using parts from the video title.
  - Using Spacy to extract key words/named entities
  - Add a copy of the lower-cased version for each sample (including augmented ones)
- If we don't use 'real' query data we can not really check how well the model generalize.

synctext commented 7 months ago

Slow progress, but now we have a concrete ToDo list for graduation. Path to graduate :checkered_flag:
"We perform training on 1 NVIDIA T4 GPU", add remark on your focus on limited computational capability.
match title of experiment with content: {examples!!} "1. Exploratory experiment for title lookup" and "2. Search experiment with critical nouns" and "3. Search experiment with user search"
- Example: for this first experiment we start with the oldest known transformer technology, BERT by Google. This oldest transformer is specifically trained for text prediction. Hence we attempt to use it for text retrieval tasks.
- Add clarity: can do table lookup, classifier? Output is an integer; basic table lookup.
- "Although we care less about the generalizability of the model due to the storage target, the extensibility of the model still is the biggest disadvantage of this approach.", reformulate. It is even positive to show that this approach fails for generic search, thus we need something more.
"Experiment with T5", experiments have a purpose and goal. Not named after a technology (e.g. T5)
- re-write start of section
- should not feel as if "we played around with T5, this is what we got"
- example: after this initial experiment with title prediction we now focus on a full retrieval scenario. This second experiment aims to find online videos using realistic generated user queries. We find that....
Third experiment Semantic indexing, a.k.a. tags
- You could drop the title! We search by tags :boom:
- see 19 Feb table with tag example
- Each Youtube video in you dataset has at least 20 tags
- For this final experiment we treat each tag as a user search query
- Training dataset:
- Ridley Scott, Blade Runner (1982) Official Trailer
- Harrison Ford Movie, Blade Runner (1982) Official Trailer
- 2013, Blade Runner (1982) Official Trailer
- sci-fi, Blade Runner (1982) Official Trailer
- Sarah Connor, The Terminator (1984) Official Trailer - Arnold Schwarzenegge Movie
- This idea confirmed to work in related work (self-retrieval) Self-retrieval benefits from index with meaningful natural language index
- different experiment from searching by critical nouns to get the video ID
- Different training set, now you add semantic indexing to enhance recall performance
- Will this work? (how many data samples??? Start with predicting 100 unique Youtube URLs? x 20 tags)
- first try by deleting all ambiguous and generic tags, like "2013".
- we removed all tags which are added to more than five Youtube videos, as they lose their expressive meaning.
- Or even only use unique tags! (try in an experiment if this works?)
- Possibly train with joining a few tags together to get unique training data for each video?

Sprint focus: focus on finishing all experimental work of this master thesis.

keonchennl commented 7 months ago

The 3rd Experiment with Tags:

Reform training data: I treat each tag as a user query and perform augmentation on each query. Augmentation involves keywords extraction from the tag and adding lower cases.
I filtered the samples such that each tag (query) is unique. This means one tag maps to only one video_id. One video_id still have multiple tags.
I split out a test set (without performing augmentation). But I found the test error does not make sense anymore. Because many tags in the test set are completely unseen and irrelevant of those in the training set. The model doesn't know which video_id it associates with. So I decide to only count recall on the training set.
For 100 original data samples, about 2300 (tag, video_id) unique pairs are generated.
- I trained 70, 150, 200 epochs. The recall on the training set gets at 0.9615076.
[notebook]
I plan to train all tags (6300*20) with DAS6. And later compare the size of the model with data stored in a relational database like SQL.
Some thoughts from the discussion with Petru:
- It's hard to treat LLM really as a 'Database'. Because it might be difficult to implement even any of CRUD operations and make them stable.
- Idea: Typo tolerant SQL (similar to fuzz search in a search engine) Input a query with typo. LLM may accept it but SQL query can not. To emulate a typo: insert 1 or 2 consecutive random chars into one tag(query)
- Idea: Self-assessment like the self-retrieval papaer
  - Train another T5 model with randomized weight. Compare the TopK beam search results. Make use of both results to improve the performance.

keonchennl commented 7 months ago

I found the Youtube Trending dataset on Kaggle has an updated version. It has more data compared to the old one Especially the USVideo set: old one has 6455 unique titles, and the new one has 48471.
I switched to this udpated version of USVideo and use tags as query. This caused 340884 unique training samples.
Training on DAS6 works! I setup the environment and is ready for training.

keonchennl commented 7 months ago

If we don't filter the tags. one tag can map to multiple videos. If we train on this unfiltered data, can it generate multiple results (video_ids)?
New idea: If one tag maps to multiple videos, how about concatenating the the video_id's into one label? Some training samples might be ("tag1", '
') In this way, the model might be able to generate a list?
- How does this compare to training multiple samples (per video_id)?
- During inference, the output length should be adjusted to let the model be able to output more words for a 'list'. This might influence the performance (accuracy).

synctext commented 7 months ago

Youtube trailers for 1000+ movies
for green light moment: finish and polish all experimental results :green_apple:
Wrapping up thesis with a 3rd and final "semantic indexing experiment" :checkered_flag:
- undertraining, overfitting, and storage capacity of VIDEO-IDs
- first train with 100 and slowly increase + show recall rate
- make a case that you did sufficient training: recall starts to drop!
LLM as a database {re-visiting again the thesis idea storyline;motivation}
- Current thesis wording: We aim to answer critical questions about the viability of LLMs as search databases, examining attributes such as stability, availability, and data integrity.
- LLM is not an SQL database
- LLM support very complex queries ('king – man + woman = queen' is the classical example)
- LLM is expensive, but semantic SQL search is expensive and immature
- LLM could be treated as a strange new type of semantic database (positioning of this scientific work)
- LLM can provide semantic search!
- LLM has unknown storage characteristics
- unknown stochastic insert and select
- your thesis: quantify these unknown properties
- several experiments around train and recall
- focus on usability in real systems: beyond 98% recall.
- How much can the semantic database store?
Essential related work by APPLE itself
- LLM in a flash
- Apple might deploy LLM inside each new iPhone: billions of deployed devices soon
- This caused a bit "flash" of attention etc google

keonchennl commented 6 months ago

Since the T5 experiment I realized that we should also pay attention to other metrics such as precision and F1 score than the recall. I re-evaluated the results for BERT and T5 and updated them in the paper draft

The results for the T5 tag experiment is better by encoding each video ID as a whole token. For 1000 and 10000 videos, it reaches 99.99% precision and recall.
By manually checking the wrongly generated video IDs, I found the model actually performed better: The wrongly generated video IDs in fact link to relevant videos because they contain the input tags (or the titles contain synonyms). They should have been true positives but count as false positives because we only count it correct if the predicted video id mapped exactly to the one in the dataset. If we use the new metric, the precision reaches 100%.
I wrapped up all experiment sections in the paper thanks to the feedback from @qstokkink Here is the Draft
Due to the promising result, I started the full dataset training last week. 48266 videos takes ~160 hours. Now the training is in the middle and expected to finish this Sunday...

synctext commented 6 months ago

thesis results are now at master thesis level :racing_car:
solid step towards graduation!
Green light form submitted :clap: :unicorn: :clap:
now focus on polishing Table 1 into a graph
- at least 7 data-points and a connecting dotted line.
- Core of the thesis!
- rough estimate of LLM semantic storage capacity for a fixed-epoch training budget.
1-week edit cycles with Quinten.

keonchennl commented 5 months ago

[Draft.pdf]

Updated Table 1 to Fig 5 with more datapoints (250 500 1000 2000 5000) and log-scale x-axis
Polished Section Introduction, Problem statement and Related Work
Meanwhile adjusted the title according to the content

synctext commented 5 months ago

no peer-to-peer distributed system within introduction
next step: final graduation form for finalisation of graduation date.
Prior thesis title ward more broad and more understandable
Using tags as queries experiment
- clear that you use tags during inference
- also mention explicitly if you use tags as training
- too vague to be reproducible
We discovered a significant number of false negatives. In many cases, the video linked to the predicted video ID you assume query trailer has a single ground truth answer. You also assume First Take has a single correct answer, however, matching on either title or tags seems correct.
- mention concept of ambiguous queries (e.g., java, apple).
- Identifying ambiguous queries in web search
- An effectiveness measure for ambiguous and underspecified queries
- make assumptions explicit
- Suggestion: "we now explore the capabilities of LLM and explicitly account for ambiguity. After manual inspection of "false negatives" we find multiple correct answer for certains queries. We define a new metric which gives a correct score to multiple answers. The remarkably small change boosts performance to the maximum achievable rate of 100% recall"
IV. SYSTEM DESIGN Now extremely short, just 2 sentences. Expand with details on your assumptions.
Key results of your thesis: table 1, it has a "n.a." missing result
OPTIONAL: spend 1-3 days of doing on-device laptop profiling
- run inference on laptop
- Can you call this edge-AI?
- Silicon support
- https://machinelearning.apple.com/research/introducing-apple-foundation-models
- {repeating} https://techblog.comsoc.org/2024/03/07/koreas-kaist-develops-next-generation-ultra-low-power-genai-llm-accelerator/
- iPhone 15 Pro: 30 tokens / second
Title: "Towards On-Device Semantic Search using LLMs"
send your draft to thesis committee for feedback.

keonchennl commented 5 months ago

Updated paper thanks to the feedback from @qstokkink [Ver20240624.pdf]
- Clarified assumptions
- Rewrote system design
- Reorganized Abstract, Introduction and Related Work sections
- Polished wording
Measured inference on my laptop (Intel(R) Core(TM) i7-10750H CPU @ 2.60GHz + 1 GPU NVIDIA GeForce GTX 1650 Ti)
- Average execution time per inference: 0.0520 seconds ~96 tokens per second (around 320.42% of Apple Foundation Model speed)
- using python. No specific hardware optimization

synctext commented 5 months ago

Nearly ready to defend :tada: (thesis coming together at a bit late stage)
Format requires change! Missing frontpage matter for libarary upload and archiving. Example
- To obtain the degree of Master of Science in Computer Science. Software Technology Track. To be defended publicly on August 29th, 2023 like text
imbuing the docid space, still has small items.
Empty appendix
- training parameter and dataset field table inside main text
Figure 1 text is smaller then main thesis writing text
- typo: mactched, look up
Fig. 1: BERT experiment process, vague title could be improved with terms like pipeline or architecture
Whole BERT section could be re-written with using Figure 1 as foundation
- explain dataset and reformatted dataset
- explain fine-tuning and evaluation
V. EXPERIMENT - BERT
- rename into something more scientific, no technology references
- "V. Initial semantic experiment" OR "index-based retrieval experiment"
VI. EXPERIMENT WITH T5
- rename into something more scientific
- "VI. first generative AI experiment"
VII. EXPERIMENT WITH T5: USE TAGS AS QUERIES
- rename into something more scientific
- "VII. generative AI with semantic tags"
refer to different entities [?], [?]