Closed synctext closed 3 months ago
I explored a bit on DB-GPT with Vicunna-7b. But it didn't work well on my local Laptop due to the RAM limit (30GB required) and the model was running on my CPU (this model could not somehow run on CUDA due to configuration). A further investigation could be:
The computing resource I have access to:
For now the most simple around seems to be nanoGPT. Simplicity is always the superior starting point for extreme decentralisation. Thus this seems like a good start for fully LLM as a database + decentralisation or local-only.
Alternative to a huge SQL database with BM25 search. The data is tokenised and transformed into LLM. The idea is that it might have some superior properties to the old SQL approach. For instance, decentralised learning with a network of 1+ million Android phones. Think TikTok scale and popularity.
Concrete proposed ToDos:
A crude RLHF (Reinforcement Learing from Human Feedback) layer on top of nanoGPT
EDIT: for decentralised learning its required that we update (e.g. instruction fine-tuning) the model on laptops or even smartphones. Qualcomm is aiming to support this. (another backup direction: take an open source LLM which support inference on Android, provide first-class support for adding a single new training item. Use-case is content discovery, decentralised search engine or (Tiktok-like) content recommendation; new added item in form of tupple: (content item, URL).
Some inspiration: https://arxiv.org/pdf/2210.06280.pdf
Thesis introduction: we know that 1 billion SQL servers are a problem. Technology like Bittorrent and Bitcoin scale without effort to 1 billion peers. LLM is mostly done on servers, with only minor on-device or decentralised approaches. This thesis investigates scaling LLM to a billion devices.
instruction-tuned PaLM model (1.5 billion parameters) to TFLite and executed through TFLite runtime {PaLM model }
Example of manual dataset for a video search engine alternative to Google, Youtube, and Tiktok
URL | Description | ||
---|---|---|---|
https://www.tiktok.com/music/Say-It-Right-Sped-Up-Remix-7041921629911304962 | Sorrel Horse Dancing to “Say It Right” | ||
https://youtu.be/eogpIG53Cis | Blade Runner (1982) Official Trailer - Ridley Scott, Harrison Ford Movie | ||
https://youtu.be/vKQi3bBA1y8 | The Matrix (1999) Official Trailer #1 - Sci-Fi Action Movie | ||
https://youtu.be/k64P4l2Wmeg | The Terminator (1984) Official Trailer - Arnold Schwarzenegge Movie | ||
https://youtu.be/bwcADuJZDNA | Mad Max: The Road Warrior | 4K Trailer | Warner Bros. Entertainment |
https://www.decayfilm.com/static/files/Decay_2012_1080p.torrent | DECAY is a zombie film made and set at the LHC | ||
https://webtorrent.io/free-torrents | public domain and Creative Commons torrents | ||
magnet : ?xt=urn:btih:08ada5a7a6183aae1e09d831df6748d566095a10 |
Sintel | ||
(NON_CLICKABLE_magnet_URL, SEE MARKDOWN SOURCE magnet : ?xt=urn:btih:08ada5a7a6183aae1e09d831df6748d566095a10) | Sintel | ||
(NON_CLICKABLE_magnet_URL) | Big Buck Bunny | ||
(NON_CLICKABLE_magnet_URL) | Cosmos Laundromat | ||
(NON_CLICKABLE_magnet_URL) | Tears of Steel |
Brainstorm on thesis direction:
PyMuPDFLoader,
and also ".md": (UnstructuredMarkdownLoader, {}),
update Chroma seems to do the heavy lifting inside PrivateGPT: see code and see tutorial example here. Please try to understand how things work!
update2 more TFLite example code. On-device text generation using GPT-2 or DistilGPT2 (same distillation process than DistilBERT, 2x faster and 33% smaller than GPT-2)
update3 Hivemind is a PyTorch library for decentralized deep learning across the Internet. Its intended usage is training one large model on hundreds of computers from different universities, companies, and volunteers.
update4 tokens for embedding and unembedding, can we hack an entire URL as a token? The unembedding matrix, which in our case computes the left inverse of the embedding matrix $(WE)−1$, is (768 * 50000) in size.
20k Youtube URLs to official music videos. also the 8M Youtube videos analysis dataset. {personal note: Easy to create a WEB3 browser using webview. With decentralised learning it should be possible to use semantic clustering to reduce the impact of the strict 50k tokens limit. With personalisation each node is aware of others with similar taste and knows dissimilar peers. All these unique 50k tables create a giant (unbounded) virtual token table.}
The pretrained part of the GPT2 model (baseline) is from https://huggingface.co/gpt2
In PrivateGPT, the custom source fed to the ingesting https://github.com/imartinez/privateGPT/blob/main/ingest.py is mainly from the extracted text from the input documents (e.g. pptx, pdf).
Discussed the idea again of "tokenize the URL". The embedding contain a static URL list, with one-hot encoding. Normally a generative model only hallucinates URLs.
{Possible thesis brainstorm} Many have written about the ongoing copyright crisis due to generative AI in the creative industry. This thesis demonstrates that AI, specifically Large Language Models pose another threat. We build upon breakthroughs in on-device machine learning and embedding to create a decentralised Google-ish search engine.
We present a tool which is able to learn online URLs for Youtube, Tiktok, Bittorrent, and IPFS. In principle, this tool removes the need for Internet intermediaries such Big Tech and Hollywood. Independent producers or influencers can easily research their audience based on our URL2Vec tooling. This will put further pressure on the legal construct of copyright.
Our starting point is the KerasNLP library by Google. This model support text completion with on-device machine learning. We crafted a decentralised search engine by building upon state-of-the-art pretrained models for natural language processing tasks and adding support for a custom tokenizer with URL understanding.
Related work to read: https://blog.reachsumit.com/posts/2023/05/tuning-llm-for-recsys/#instruction-finetuned-llms-for-recommendations
Naive ToDo list for starting experiments:
The Terminator (1984) Official Trailer - Arnold Schwarzenegge Movie can be found at https://youtu.be/k64P4l2Wmeg
Working from the "Naive ToDo" list, concrete steps toward publishable results could be the following:
token sequence -> linear (i.e., 1 magnet link)
NL -> token sequence -> linear (i.e., 1 magnetlink)
NL -> token sequence -> generated magnetlink (20 bytes/160 bits output)
It seems my idea for comparison (between transformers and RNNs) has been performed before: https://arxiv.org/pdf/2005.09471.pdf Instead of natural language next word prediction, you would be investigating next word prediction of a fixed-size resource but this is probably good related work to reference.
Open LLM challenges. Great background read for writing introduction and citations for Problem Description: https://huyenchip.com/2023/08/16/llm-research-open-challenges.html
The guiding query for the entire mater thesis? query "Where on The Internet can I find the 1984 The Terminator movie trailer?"
assume a static list of internet URLs, no new knowledge
this tutorial prepares for the complexity of nanoGPT: https://pytorch.org/tutorials/beginner/nlp/sequence_models_tutorial.html
NanoGPT uses positional encoding! assign weights to the position of the terms, used in NanoGPT
Selected dataset for coming months. Most simple step with Youtube URLs dataset. Only two columns: title and Video ID. https://www.kaggle.com/datasets/datasnaek/youtube-new?select=USvideos.csv
Upcoming sprint outline
https://www.youtube.com/watch?v=2kyS6SvSYSE
https://www.youtube.com/watch?v=1ZAPwfrtAFY
https://www.youtube.com/watch?v=5qpjK5DgCt4
https://www.youtube.com/watch?v=puqaWrEC7tY
Only after this is operational we take next step: generative AI. We use the most simple approach of the token ID
plus token string
embedding as the base line. Then we compare various queries and further work on improving our dataset. This looks sufficient depth for a Delft University master thesis :clap: :confetti_ball: :clap:
Basic transformer and NanoGPT tutorial . required preliminairies.
In Sep/Oct we focus on generative AI. Generate from scratch and pick from a huge list. "Generative AI against URL hallucinations" master thesis title idea. Actually model the magnet link with 20 Bytes of the SHA1 hash (160 bits). Generate 160 bits in the generative AI at the neuron level. Next step sequence model and next token prediction. First bytes of a magnet link predicts the remainder of the URL. Idea by @qstokkink. Warning: magnet link is already difficult and sufficient for master tehsis. General approach for any variable sized URL (Tiktok URL, Youtube, IPFS link, magnet link) is out of scope. {note for future, bigger dataset 20k Youtube URLs to official music videos. also the 8M Youtube videos analysis dataset.}
Please do an issue update for next meeting, screenshot, progress and dataset.
Some progress has been made:
Some reflections:
2 epoch the model was able to predict 63.5% video ids correctly including the 4 given video titles.
:clap: :confetti_ball:
6351 unique values
in the USVideos.csv. Your have 40949
items in youtube_video_id_predictor.ipynb
?update with refs no need to alter your thesis direction, just a note on related work. Recent advances in retrieval-augmented text generation plus intro for that: https://blog.lancedb.com/llms-rag-the-missing-storage-layer-for-ai-28ded35fa984
video_ids
, resulting in 6351
unique values, and perform more epochs (20 or 30) of training, the recall rate drops to nearly 0. The training error almost did not drop.
recall of 96.19%
huge improvement from 63.5%
. Great progress :clap:
I made little progress this sprint, unfortunately. I reformatted the notebook [Notebook] and will try to see how the following issue may influence the result:
LabelBinarizer()
for Youtube video IDs using one-hot encoding.The texts are lowercased and tokenized using WordPiece and a vocabulary size of 30,000.
Amazing related work by Google Research found by our phd student Petru: https://github.com/Tribler/tribler/issues/7586#issuecomment-1790956120 Transformer Memory as a Differentiable Search Index. The paper argues that instead of using a dual-encoder method (where we encode the query and the document on the same space and then find the document which is nearest neighbour to the query) we can use the differentiable-search-index (DSI), where we have a neural network map directly the query to the document. The paper presents a number of methods to achieve this but the easiest one to implement for me at this time was to simply assign each document one number, have the output layer of the network be composed of the same number of neutrons as the number of documents and make the network essentially assign probabilities to each document, given a query. Additionally, the paper performs this work with a Transformer architecture, raising the possibility of us integrating Nanogpt into the future architecture.
Even more related work for intro + problem description: https://github.com/vectara/hallucination-leaderboard
dictionary_title_with_stop_words.txt dictionary_title_without_stop_words.txt
BertForSequenceClassification
(encoder + a classification layer) looks a bit similar to the DSI approach that paper proposed. Perhaps it's nice to look into applying a (encoder + decoder) seq-to-seq model.google news neg 300
model is also tried. Notebook
us_videos_data = pd.read_csv(workdir_path / 'USvideos.csv')
)
We perform training on a NVIDIA T4 GPU for 8 epochs.
, add that you simply use the Google free GPU cloud offeringExample prompt: "Retrieve a video ID to your knowledge given the following text: 'WE WANT TO TALK ABOUT OUR MARRIAGE' and return the video ID (an 11-character string) directly"
And the expected output should be: "2kyS6SvSYSE" (from url https://www.youtube.com/watch?v=2kyS6SvSYSE)
The training examples could be: positive sample: "The Youtube video titled "WE WANT TO TALK ABOUT OUR MARRIAGE" has video id: '2kyS6SvSYSE' " negative sample: "The Youtube video titled "WE WANT TO TALK ABOUT OUR MARRIAGE" has video id: '1ZAPwfrtAFY' " (where 1ZAPwfrtAFY is from another video)
Experiment with T5 (the naive approach) The model training logs can be found here
[One of the notebooks]
The main doubt now is the unknown of how the model see (encode/decode) the video_ids. Further trying out of the new ideas is going on.
Thanks to the 'debug' session with Petru, things got clarified and a defect in the code was discovered and fixed. Some findings during exploration after the session:
As the down-scaling experiment works, I picked out 50 samples and trained more epochs till the model overfits (<0.0001 loss). The recall rate gets nicely to at most 100%. (But not stable - it varies from 76% to 100%) However, since it overfits very much, only the exact title gives a valid and correct ID. If I input a partial ID or one or a few words from the title, the model starts to hallucinate a lot.
I realized that 'overfit as much as possible' might be the wrong direction. Because for searching we actually want the model to generalize to handle fuzzy searchs. We want it to also perform well when we input part of the title or some keywords. In the exploration with BERT, the final mapping from the output index embedding to the video_id somehow hid this issue. Now that the model directly outputs the video_id, it's time to avoid overfitting.
I then came back to the 50 samples exploration. I tried data augmentation: I sampled phrases and words from each title and included the lower cases of the words for these corpora. The augmented data set size goes up to ~650, about 15 times of the original dataset.
But this seems to work well. The recall rate reaches 100% after 100 epochs of training of 3 hours.
A demo notebook can be found [here]
As the 50-sample dataset gives good results, I tried scaling up directly to train with 6455 samples with augmented data again. I set the epoch less than the 50 samples run, the required training time is expected to be 17 hours. But it still crashed at the 13th hour due to the Colab environment. Colab free tier allows at most 12 hours max connection even if I used a custom GCP compute engine
I retried using 2030 samples (augmented to 15108 samples) with 2006 video ids. And trained for 13 hours. The training finished successfully. But the result recall rate was low
I then looked into the augmented data and think the augmentation can be optimized. I switched to use Spacy to sub-sampling key words from the title. And I optimized proprocessing of the data by applying lower case on the original title and the augmented part.
A rerun on 2030 samples (aumented to 10605 samples) with 2007 video_ids gives good result!
URL | Description | tags |
---|---|---|
https://youtu.be/eogpIG53Cis | Blade Runner (1982) Official Trailer - Ridley Scott, Harrison Ford Movie | trailers HD, hd, trailers, trailer, 2013, official, HD, classic trailers, oldhollywoodtrailers, Harrison Ford, sci-fi, thriller, classic, blade runner, blade runner official trailer, blade runner trailer |
https://youtu.be/vKQi3bBA1y8 | The Matrix (1999) Official Trailer #1 - Sci-Fi Action Movie | classic movie, movieclips, movieclipstrailers, movie clips, movieclipsDOTcom, movieclipscomingsoon, zefr, jslewis, Matrix, The Matrix movie, The Matrix trailer, The Matrix film, Lana Wachowski, Andy Wachowski, wachowkis, Keanu Reeves, Laurence Fishburne, Carrie-Anne Moss, Hugo Weaving, matrix, sci-fi, action, bullet time |
https://youtu.be/k64P4l2Wmeg | The Terminator (1984) Official Trailer - Arnold Schwarzenegge Movie | The Terminator, The Terminator movie, The Terminator trailer, 1984, James Cameron, Arnold Schwarzenegger, Linda Hamilton, Michael Biehn, Lance Henriksen, Earl Boen, Bill Paxton, Dick Miller, cyborg, indestructible, assassinate, war against the machines, soldier, i'll be back, Come with me if you want to live., Kyle Reese, Sarah Connor, Terminator, action, sci-fi, fandango, movieclips, trailer, classic trailer, trailer vault, mgm, hd |
https://youtu.be/bwcADuJZDNA | Mad Max: The Road Warrior 4K Trailer Warner Bros. Entertainment | Warner brothers movies, warner bros movies 2019, warner bros movies trailers, warner bros movies 2020, warner brothers home entertainment, warnermedia, buy movies on youtube, stream movies online, rent movies online, Buy Mad Max: The Road Warrior online, Watch Mad Max: The Road Warrior online, Rent Mad Max: The Road Warrior, Stream Mad Max: The Road Warrior online, Stream Mad Max: The Road Warrior full movie online, watch Mad Max: The Road Warrior full movie online, 4K Trailer |
ToDo next sprint: document your first 2 (additional) master thesis pages. 1 Figure with, example: 20,50,200, 2030, and 6455 samples. Both learning rate figure and precision figure? All lower-case and using your Spacy sub-sampling idea? Please be sure to explain everything you are doing. Another master students should be able to reproduce your results somewhat. (https://www.overleaf.com/read/jnbcnktyfrgq#719f90)
- Results 1 step == 20 samples.
Here I meant that it requires '20 steps' for 200 samples
(the full dataset).
Updates:
I re-thought of the topic and re-wrote the the introduction and problem statement section -[draft20240311.pdf] The focus is on 'memorize-and-search' and I assume limited computing power
The T5 experiment has been added
I used up my soon-to-expire Google Cloud credits for completing the 'Precision vs Data size' graph.
Upcoming sprint: please finish all text of the T5 experiments. Then we can move to earlier sections (intro, design). Finally, add the tags-based semantic experiment. Graduate :checkered_flag:
Self-retrieval benefits from index with meaningful natural language index
Sprint focus: focus on finishing all experimental work of this master thesis.
The 3rd Experiment with Tags:
Reform training data: I treat each tag as a user query and perform augmentation on each query. Augmentation involves keywords extraction from the tag and adding lower cases.
I filtered the samples such that each tag (query) is unique. This means one tag maps to only one video_id
. One video_id
still have multiple tags.
I split out a test set (without performing augmentation). But I found the test error does not make sense anymore. Because many tags in the test set are completely unseen and irrelevant of those in the training set. The model doesn't know which video_id
it associates with. So I decide to only count recall on the training set.
For 100 original data samples, about 2300 (tag, video_id) unique pairs are generated.
[notebook]
I plan to train all tags (6300*20) with DAS6. And later compare the size of the model with data stored in a relational database like SQL.
Some thoughts from the discussion with Petru:
We aim to answer critical questions about the viability of LLMs as search databases, examining attributes such as stability, availability, and data integrity.
insert
and select
Since the T5 experiment I realized that we should also pay attention to other metrics such as precision and F1 score than the recall. I re-evaluated the results for BERT and T5 and updated them in the paper draft
[Draft.pdf]
We discovered a significant number of false negatives. In many cases, the video linked to the predicted video ID
you assume query trailer has a single ground truth answer. You also assume First Take
has a single correct answer, however, matching on either title or tags seems correct.
To obtain the degree of Master of Science in Computer Science. Software Technology Track. To be defended publicly on August 29th, 2023
like textimbuing the docid space
, still has small items.mactched
, look up
Fig. 1: BERT experiment process
, vague title could be improved with terms like pipeline or architectureV. EXPERIMENT - BERT
VI. EXPERIMENT WITH T5
VII. EXPERIMENT WITH T5: USE TAGS AS QUERIES
refer to different entities [?], [?]
placeholder for brainstorm. Finished all master courses. (part-time side job) Exploring for 1 month what a good master thesis direction is around LLM.
Draft master thesis (again placeholder): Adding memory to LLM and large-scale ingestion of facts
Recommended paper to understand your thesis context and goal further. With donations of resources by volunteers it is possible to build a giant foundational model. Towards Crowdsourced Training of Large Neural Networks using Decentralized Mixture-of-Experts.
with 22k stars this is more popular: https://github.com/imartinez/privateGPT
LLM: default to [ggml-gpt4all-j-v1.3-groovy.bin](https://gpt4all.io/models/ggml-gpt4all-j-v1.3-groovy.bin). If you prefer a different GPT4All-J compatible model, just download it and reference it in your .env file.
A possible starting point is the Vicuna enhancement, as a database: https://github.com/csunny/DB-GPTIn addition, we provide private domain knowledge base question-answering capability through LangChain. Furthermore, we also provide support for additional plugins, and our design natively supports the Auto-GPT plugin.
Third option: NanoGPTThe simplest, fastest repository for training/finetuning medium-sized GPTs. It is a rewrite of [minGPT](https://github.com/karpathy/minGPT) that prioritizes teeth over education. Still under active development, but currently the file train.py reproduces GPT-2 (124M) on OpenWebText
Fourth: smaller than medium{nano} is https://github.com/Lightning-AI/Lit-ParrotHackable implementation of state-of-the-art open-source large language models.
Concrete ToDo:Please register here: https://mare.ewi.tudelft.nl/