Tribler / tribler

Privacy enhanced BitTorrent client with P2P content discovery
https://www.tribler.org
GNU General Public License v3.0
4.86k stars 450 forks source link

Msc placeholder: exploring LLM as a database #7435

Closed synctext closed 3 months ago

synctext commented 1 year ago

placeholder for brainstorm. Finished all master courses. (part-time side job) Exploring for 1 month what a good master thesis direction is around LLM.

Draft master thesis (again placeholder): Adding memory to LLM and large-scale ingestion of facts

Recommended paper to understand your thesis context and goal further. With donations of resources by volunteers it is possible to build a giant foundational model. Towards Crowdsourced Training of Large Neural Networks using Decentralized Mixture-of-Experts.

with 22k stars this is more popular: https://github.com/imartinez/privateGPT LLM: default to [ggml-gpt4all-j-v1.3-groovy.bin](https://gpt4all.io/models/ggml-gpt4all-j-v1.3-groovy.bin). If you prefer a different GPT4All-J compatible model, just download it and reference it in your .env file. A possible starting point is the Vicuna enhancement, as a database: https://github.com/csunny/DB-GPT In addition, we provide private domain knowledge base question-answering capability through LangChain. Furthermore, we also provide support for additional plugins, and our design natively supports the Auto-GPT plugin. Third option: NanoGPT The simplest, fastest repository for training/finetuning medium-sized GPTs. It is a rewrite of [minGPT](https://github.com/karpathy/minGPT) that prioritizes teeth over education. Still under active development, but currently the file train.py reproduces GPT-2 (124M) on OpenWebText Fourth: smaller than medium{nano} is https://github.com/Lightning-AI/Lit-Parrot Hackable implementation of state-of-the-art open-source large language models. Concrete ToDo:

image

Please register here: https://mare.ewi.tudelft.nl/

keonchennl commented 1 year ago

I explored a bit on DB-GPT with Vicunna-7b. But it didn't work well on my local Laptop due to the RAM limit (30GB required) and the model was running on my CPU (this model could not somehow run on CUDA due to configuration). A further investigation could be:

The computing resource I have access to:

synctext commented 1 year ago

For now the most simple around seems to be nanoGPT. Simplicity is always the superior starting point for extreme decentralisation. Thus this seems like a good start for fully LLM as a database + decentralisation or local-only.

Alternative to a huge SQL database with BM25 search. The data is tokenised and transformed into LLM. The idea is that it might have some superior properties to the old SQL approach. For instance, decentralised learning with a network of 1+ million Android phones. Think TikTok scale and popularity.

Concrete proposed ToDos:

EDIT: for decentralised learning its required that we update (e.g. instruction fine-tuning) the model on laptops or even smartphones. Qualcomm is aiming to support this. (another backup direction: take an open source LLM which support inference on Android, provide first-class support for adding a single new training item. Use-case is content discovery, decentralised search engine or (Tiktok-like) content recommendation; new added item in form of tupple: (content item, URL).

bacox commented 1 year ago

Some inspiration: https://arxiv.org/pdf/2210.06280.pdf

synctext commented 1 year ago

Thesis introduction: we know that 1 billion SQL servers are a problem. Technology like Bittorrent and Bitcoin scale without effort to 1 billion peers. LLM is mostly done on servers, with only minor on-device or decentralised approaches. This thesis investigates scaling LLM to a billion devices.

instruction-tuned PaLM model (1.5 billion parameters) to TFLite and executed through TFLite runtime {PaLM model }

Example of manual dataset for a video search engine alternative to Google, Youtube, and Tiktok

URL Description
https://www.tiktok.com/music/Say-It-Right-Sped-Up-Remix-7041921629911304962 Sorrel Horse Dancing to “Say It Right”
https://youtu.be/eogpIG53Cis Blade Runner (1982) Official Trailer - Ridley Scott, Harrison Ford Movie
https://youtu.be/vKQi3bBA1y8 The Matrix (1999) Official Trailer #1 - Sci-Fi Action Movie
https://youtu.be/k64P4l2Wmeg The Terminator (1984) Official Trailer - Arnold Schwarzenegge Movie
https://youtu.be/bwcADuJZDNA Mad Max: The Road Warrior 4K Trailer Warner Bros. Entertainment
https://www.decayfilm.com/static/files/Decay_2012_1080p.torrent DECAY is a zombie film made and set at the LHC
https://webtorrent.io/free-torrents public domain and Creative Commons torrents
magnet : ?xt=urn:btih:08ada5a7a6183aae1e09d831df6748d566095a10 Sintel
(NON_CLICKABLE_magnet_URL, SEE MARKDOWN SOURCE magnet : ?xt=urn:btih:08ada5a7a6183aae1e09d831df6748d566095a10) Sintel
(NON_CLICKABLE_magnet_URL) Big Buck Bunny
(NON_CLICKABLE_magnet_URL) Cosmos Laundromat
(NON_CLICKABLE_magnet_URL) Tears of Steel

Brainstorm on thesis direction:

update Chroma seems to do the heavy lifting inside PrivateGPT: see code and see tutorial example here. Please try to understand how things work! update2 more TFLite example code. On-device text generation using GPT-2 or DistilGPT2 (same distillation process than DistilBERT, 2x faster and 33% smaller than GPT-2) update3 Hivemind is a PyTorch library for decentralized deep learning across the Internet. Its intended usage is training one large model on hundreds of computers from different universities, companies, and volunteers. update4 tokens for embedding and unembedding, can we hack an entire URL as a token? The unembedding matrix, which in our case computes the left inverse of the embedding matrix $(WE)−1$, is (768 * 50000) in size. image 20k Youtube URLs to official music videos. also the 8M Youtube videos analysis dataset. {personal note: Easy to create a WEB3 browser using webview. With decentralised learning it should be possible to use semantic clustering to reduce the impact of the strict 50k tokens limit. With personalisation each node is aware of others with similar taste and knows dissimilar peers. All these unique 50k tables create a giant (unbounded) virtual token table.}

keonchennl commented 1 year ago

image image

The pretrained part of the GPT2 model (baseline) is from https://huggingface.co/gpt2

In PrivateGPT, the custom source fed to the ingesting https://github.com/imartinez/privateGPT/blob/main/ingest.py is mainly from the extracted text from the input documents (e.g. pptx, pdf).

synctext commented 1 year ago

Discussed the idea again of "tokenize the URL". The embedding contain a static URL list, with one-hot encoding. Normally a generative model only hallucinates URLs.

URL2Vec: AI crisis for copyright monopolies

{Possible thesis brainstorm} Many have written about the ongoing copyright crisis due to generative AI in the creative industry. This thesis demonstrates that AI, specifically Large Language Models pose another threat. We build upon breakthroughs in on-device machine learning and embedding to create a decentralised Google-ish search engine.

We present a tool which is able to learn online URLs for Youtube, Tiktok, Bittorrent, and IPFS. In principle, this tool removes the need for Internet intermediaries such Big Tech and Hollywood. Independent producers or influencers can easily research their audience based on our URL2Vec tooling. This will put further pressure on the legal construct of copyright.

Our starting point is the KerasNLP library by Google. This model support text completion with on-device machine learning. We crafted a decentralised search engine by building upon state-of-the-art pretrained models for natural language processing tasks and adding support for a custom tokenizer with URL understanding.

Related work to read: https://blog.reachsumit.com/posts/2023/05/tuning-llm-for-recsys/#instruction-finetuned-llms-for-recommendations

Naive ToDo list for starting experiments:

qstokkink commented 1 year ago

Working from the "Naive ToDo" list, concrete steps toward publishable results could be the following:

  1. Adapt https://pytorch.org/tutorials/beginner/nlp/sequence_models_tutorial.html to create AI that can convert token sequence -> linear (i.e., 1 magnet link)
  2. Add NanoGPT to this model for NL -> token sequence -> linear (i.e., 1 magnetlink)
  3. Train this and see what happens.
  4. Use RNN instead of a linear layer for NL -> token sequence -> generated magnetlink (20 bytes/160 bits output)
  5. Train this new model and see if it is better than the results from step 3.
  6. Publish results?
qstokkink commented 1 year ago

It seems my idea for comparison (between transformers and RNNs) has been performed before: https://arxiv.org/pdf/2005.09471.pdf Instead of natural language next word prediction, you would be investigating next word prediction of a fixed-size resource but this is probably good related work to reference.

synctext commented 1 year ago

Open LLM challenges. Great background read for writing introduction and citations for Problem Description: https://huyenchip.com/2023/08/16/llm-research-open-challenges.html

synctext commented 1 year ago
keonchennl commented 1 year ago

Some progress has been made:

Some reflections:

synctext commented 1 year ago

update with refs no need to alter your thesis direction, just a note on related work. Recent advances in retrieval-augmented text generation plus intro for that: https://blog.lancedb.com/llms-rag-the-missing-storage-layer-for-ai-28ded35fa984

keonchennl commented 1 year ago
synctext commented 1 year ago
keonchennl commented 1 year ago

I made little progress this sprint, unfortunately. I reformatted the notebook [Notebook] and will try to see how the following issue may influence the result:

synctext commented 1 year ago
keonchennl commented 1 year ago
synctext commented 1 year ago
synctext commented 1 year ago

Amazing related work by Google Research found by our phd student Petru: https://github.com/Tribler/tribler/issues/7586#issuecomment-1790956120 Transformer Memory as a Differentiable Search Index. The paper argues that instead of using a dual-encoder method (where we encode the query and the document on the same space and then find the document which is nearest neighbour to the query) we can use the differentiable-search-index (DSI), where we have a neural network map directly the query to the document. The paper presents a number of methods to achieve this but the easiest one to implement for me at this time was to simply assign each document one number, have the output layer of the network be composed of the same number of neutrons as the number of documents and make the network essentially assign probabilities to each document, given a query. Additionally, the paper performs this work with a Transformer architecture, raising the possibility of us integrating Nanogpt into the future architecture.

Even more related work for intro + problem description: https://github.com/vectara/hallucination-leaderboard

keonchennl commented 1 year ago

Dictionary extracted from titles from US videos dataset

dictionary_title_with_stop_words.txt dictionary_title_without_stop_words.txt

Investigation about the broken code (Notebook)

  1. Fixed a bug in the dataset class
  2. Things were tried to check why the result of the best model (with 96% recall) could not have been reproduced.
    • It turns out that the training still works but the data that was used to calculate the evaluation score was not the one for training.
    • A subset of the dataset (32759 samples) was used for training the model but the whole dataset (40949 samples) was used for evaluation
    • It happened due to the exploration of the dataset splitting and de-duplication
  3. It was able to reproduce the 96% recall using the same subset data (32759 samples).
    • The best model can be found here
    • The data for reproduction can be found here or be retrieved via a 80/20 split with a random state of 42 (see the notebook).

Findings

  1. With the same training data. the model has high possibility of not converging because of the randomness in the training process. 5 experiments have been performed, but only 1 has loss dropped below 7.5. image
  2. With the best model, the performance given the whole title is good. But the fewer words we give the worse performance it may have. For example, if we give 'cat', it can hardly predict title that has 'cat'.
  3. I checked the Differentiable Search Index (DSI) approach Fine tuning, a BertForSequenceClassification (encoder + a classification layer) looks a bit similar to the DSI approach that paper proposed. Perhaps it's nice to look into applying a (encoder + decoder) seq-to-seq model.
  4. The metric now is comparing the exact title. Maybe I should involve other relevance metrics. Such that the 'cat' example works well.

Experiments with Word2Vec

  1. I explored starting with word2vec from scratch.
    • word2vec => vectors => nearest neighbor => the closest
    • Notebook can be found here
    • To represent the video better. Words from description and tags are also included for training.
    • Different hyperparameters are tried.
    • The best recall we get so far is 18.67%
    • The exact title prediction gives bad performance. But the one word prediction looks better than the BERT model.
  2. Then rather than training from scratch, using google news neg 300 model is also tried. Notebook
    • bad performance as well: <1% recall
    • index issue
    • rare word issue. e.g. 'aquarius' is not in the vocabulary
synctext commented 1 year ago
keonchennl commented 11 months ago
synctext commented 11 months ago
keonchennl commented 10 months ago

Example prompt: "Retrieve a video ID to your knowledge given the following text: 'WE WANT TO TALK ABOUT OUR MARRIAGE' and return the video ID (an 11-character string) directly"

And the expected output should be: "2kyS6SvSYSE" (from url https://www.youtube.com/watch?v=2kyS6SvSYSE)

The training examples could be: positive sample: "The Youtube video titled "WE WANT TO TALK ABOUT OUR MARRIAGE" has video id: '2kyS6SvSYSE' " negative sample: "The Youtube video titled "WE WANT TO TALK ABOUT OUR MARRIAGE" has video id: '1ZAPwfrtAFY' " (where 1ZAPwfrtAFY is from another video)

synctext commented 10 months ago
keonchennl commented 10 months ago
synctext commented 10 months ago
keonchennl commented 9 months ago

A demo notebook can be found [here]

keonchennl commented 9 months ago
synctext commented 9 months ago
URL Description tags
https://youtu.be/eogpIG53Cis Blade Runner (1982) Official Trailer - Ridley Scott, Harrison Ford Movie trailers HD, hd, trailers, trailer, 2013, official, HD, classic trailers, oldhollywoodtrailers, Harrison Ford, sci-fi, thriller, classic, blade runner, blade runner official trailer, blade runner trailer
https://youtu.be/vKQi3bBA1y8 The Matrix (1999) Official Trailer #1 - Sci-Fi Action Movie classic movie, movieclips, movieclipstrailers, movie clips, movieclipsDOTcom, movieclipscomingsoon, zefr, jslewis, Matrix, The Matrix movie, The Matrix trailer, The Matrix film, Lana Wachowski, Andy Wachowski, wachowkis, Keanu Reeves, Laurence Fishburne, Carrie-Anne Moss, Hugo Weaving, matrix, sci-fi, action, bullet time
https://youtu.be/k64P4l2Wmeg The Terminator (1984) Official Trailer - Arnold Schwarzenegge Movie The Terminator, The Terminator movie, The Terminator trailer, 1984, James Cameron, Arnold Schwarzenegger, Linda Hamilton, Michael Biehn, Lance Henriksen, Earl Boen, Bill Paxton, Dick Miller, cyborg, indestructible, assassinate, war against the machines, soldier, i'll be back, Come with me if you want to live., Kyle Reese, Sarah Connor, Terminator, action, sci-fi, fandango, movieclips, trailer, classic trailer, trailer vault, mgm, hd
https://youtu.be/bwcADuJZDNA Mad Max: The Road Warrior 4K Trailer Warner Bros. Entertainment Warner brothers movies, warner bros movies 2019, warner bros movies trailers, warner bros movies 2020, warner brothers home entertainment, warnermedia, buy movies on youtube, stream movies online, rent movies online, Buy Mad Max: The Road Warrior online, Watch Mad Max: The Road Warrior online, Rent Mad Max: The Road Warrior, Stream Mad Max: The Road Warrior online, Stream Mad Max: The Road Warrior full movie online, watch Mad Max: The Road Warrior full movie online, 4K Trailer

ToDo next sprint: document your first 2 (additional) master thesis pages. 1 Figure with, example: 20,50,200, 2030, and 6455 samples. Both learning rate figure and precision figure? All lower-case and using your Spacy sub-sampling idea? Please be sure to explain everything you are doing. Another master students should be able to reproduce your results somewhat. (https://www.overleaf.com/read/jnbcnktyfrgq#719f90)

keonchennl commented 9 months ago
  • Results 1 step == 20 samples.

Here I meant that it requires '20 steps' for 200 samples (the full dataset).

keonchennl commented 8 months ago

Updates:

synctext commented 8 months ago

Upcoming sprint: please finish all text of the T5 experiments. Then we can move to earlier sections (intro, design). Finally, add the tags-based semantic experiment. Graduate :checkered_flag:

keonchennl commented 7 months ago
synctext commented 7 months ago

Sprint focus: focus on finishing all experimental work of this master thesis.

keonchennl commented 7 months ago

The 3rd Experiment with Tags:

keonchennl commented 7 months ago
keonchennl commented 7 months ago
synctext commented 7 months ago
keonchennl commented 6 months ago

Since the T5 experiment I realized that we should also pay attention to other metrics such as precision and F1 score than the recall. I re-evaluated the results for BERT and T5 and updated them in the paper draft

synctext commented 6 months ago
keonchennl commented 5 months ago

[Draft.pdf]

synctext commented 5 months ago
keonchennl commented 5 months ago
synctext commented 5 months ago