Tribler / tribler

Privacy enhanced BitTorrent client with P2P content discovery
https://www.tribler.org
GNU General Public License v3.0
4.74k stars 445 forks source link

msc placeholder: LLM as a Google alternative #7438

Open synctext opened 1 year ago

synctext commented 1 year ago

brainstorm Survey+thesis 1 course msc for Q1 left. Did ML course and industry Kubernets experience. prior google summer of code experience. Python == main working language. Possibly: https://bitcoinlib.readthedocs.io/ on Python side :euro: and LLM/semantic search from systems side for ECTS :school: survey ideas: guide to cloud-free local-first LLM. Both training, re-training, and inference. Thesis could go into numerous directions

kandrio commented 1 year ago

We currently want to decide about two things:

  1. literature survey
  2. thesis direction

Ultimately, injecting databases to LLMs seems really interesting to me. I like the idea of extending LLMs with fact loading and enabling them to reference their sources. Therefore, this kind of direction seems perfect for the thesis. What do you think?

@synctext what would be an ideal literature survey topic to help me gain knowledge towards that direction?

survey ideas: guide to cloud-free local-first LLM. Both training, re-training, and inference.

The above proposal looks interesting. However, I don't understand why you linked to the Bitcoinlib docs. Any papers you could point me to for the survey?

InvictusRMC commented 12 months ago

Hey Rowdy here, great that you'll be helping out. The superapp is, frankly speaking, a bit of a mess. Please reach out to me by email (R.M.Chotkan-1@tudelft.nl) to arrange a meeting to discuss the superapp. The last time we can have a face-to-face meeting is the 18th of July, after that, it'll have to be remote.

The current suspects of causing issues within the superapp:

Also, there are no e2e tests: we could use Espresso tests for the app.

synctext commented 12 months ago

Discussed focus of survey, summer job, and thesis. Lets do Kotlin :rocket:

kandrio commented 11 months ago

I created a parent issue just for my summer work on the superapp:

From now on, I'll be exposing my findings and progress regarding the superapp there.

kandrio commented 11 months ago

Papers I found on data-augmentation of GPT LLMs:

  1. Ghazvininejad et al., 2017: https://arxiv.org/pdf/2302.12813.pdf#page=11&zoom=100,401,665
  2. Dinan et al., 2018 (using Wikipedia articles): https://arxiv.org/pdf/2302.12813.pdf#page=11&zoom=100,401,252
  3. Shuster et al., 2022 (using web-search): https://arxiv.org/pdf/2302.12813.pdf#page=12&zoom=100,401,841
  4. Peng et al., 2022 (unstructured knowledge): https://arxiv.org/pdf/2302.12813.pdf#page=12&zoom=100,401,304
kandrio commented 11 months ago

Literature Survey: Augmenting LLMs with Knowledge Retrieval

Overleaf Project: https://www.overleaf.com/read/fwyqhjskmdrc

I've been reading through a number of papers, the most recent one being: Internet-Augmented Dialogue Generation, by Facebook AI Research. This paper proposes a system that combines:

  1. Retrieval-augmented Generation, and
  2. Search Engine Augmented Generation

It provides a nice overview of different methods of klowledge retrieval (using neural networks and an unstructured knowledge base), and it also cites the original papers:

I plan to read through these papers by August 20th and informative summaries for each of the methods.

One paper that summarizes all of the above (FiD and RAG) is:

There are also a number of papers talking about augmenting LLMs with a structured knowledge base (graph):

Google Bard

Google's AI experiment is called Bard. It uses knowledge retrieval and it is inspired by the following two papers:

kandrio commented 10 months ago

Summary of paper about RAG (Retrieval Augmented Generation): Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks (2020)

Preliminaries

seq2seq models

A seq2seq model predicts the probability of the next token, given an input sequence of words.

It consists of:

The encoder reads the input sequence one timestep at a time and produces a fixed-dimensional vector representation of the entire sequence. This vector is called a context vector and it contains all the information from the input sequence. The context vector is then passed to the decoder, which generates the output sequence one timestep at a time.

Beam Search

Beam Search is a heuristic search algorithm that explores a graph G by expanding only the K (beam width) most promising nodes at each step. Beam Search simulates the behavior of Breadth-First Search. Specifically:

Beam Search in NLP: When using seq2seq models, we utilize Beam Search to find the sequence y that is most likely to come after an input sequence x. In mathematic notation, the probability we aim to maximize is:

Dense vector index

In a vector database, a document can correspond to one vector or many vectors, depending on the specific implementation of the database. A single vector captures the overall meaning of the document. This is often done by averaging the vectors of the words in the document. In other cases, a document may be represented by a vector for each word in the document. This is often done when it is important to be able to track the individual words in the document.

Indexing in a vector database is the process of organizing the vectors in the database in a way that makes it efficient to search for similar vectors. This is done by creating a data structure that maps each vector to a set of other vectors that are similar to it.

Top-K Sampling

This paper uses top-K sampling on the retriever side, This means that, instead of choosing only the document in the knowledge base that's the most similar to the input, we use the K most similar documents and we feed each one of them in the encoder.

Overview

The paper uses a pre-trained seq2seq model (BART) as the parametric memory (knowledge stored in the parameters of the model). This model is trained on a massive dataset of text and code, and it can be used to generate text, translate languages, write different kinds of creative content, and answer your questions in an informative way. The model's knowledge is stored in its parameters, which are a set of weights that are learned during training.

Additionally, the paper uses a dense vector index of Wikipedia as the non-parametric memory (knowledge stored in an indexed database). The Wikipedia index is a large database of text that has been pre-processed and indexed. This allows the RAG model to quickly retrieve relevant passages from Wikipedia. The system:

  1. uses the Inner Product to calculate the similarity between the given query and each passage in the database.
  2. gets the top-K similar passages.

These passages are then used to augment the model's knowledge, which allows it to generate more accurate and informative responses.

In summary, the RAG model uses the parametric memory to generate a query that is then used to retrieve relevant passages from the non-parametric memory. The retrieved passages are then used to augment the model's knowledge, which allows it to generate more accurate and informative responses.

Components

Knowledge Base (Wikipedia)

The indexed (for fast retrieval) knowledge base serves as the aggregation of knowledge that the RAG model possesses.

Retriever (BERT)

The (pretrained) Retriever component solves the Maximum Inner Product Search problem (MIPS) and finds a list of k documents with the highest similarity with the input query x. The documents are stored in a BERTBASE database (encoded as vectors using a BERTd document encoder) and are compared with the BERTq vector of the input query. MIPS algorithms run in sublinear time which is very needed since the database can be extremely large. Therefore, calculating the inner product of the query embedding with each document in the database is extremely inefficient and is avoided (through MIPS algorithms).

NOTE: According to the authors of the paper, the training of the parameters of the BERTd encoder is costly and not very effective accuracy-wise. Therefore, during the fine-tuning stage, they only fine-tune the parameters of the query encoder BERTq.

Generator (DPR)

The (pretrained) Generator component is a BART seq2seq model that receives the input query, x and the list of documents, z as input and generates a response text. During training, the BART generator is fine-tuned. This paper proposes two different implementations for the Generator:

IMPORTANT: Both the retriever and the generator are pre-trained. The authors chose to update these two components only during the fine-tuning stage (end-to-end). Later on, we will analyze a paper called REALM which was the first that proposed end-to-end training of a similar retriever-generator architecture.

RAG-token

The RAG-token model takes into account all of the retrieved documents (separately) in order to generate each token of a sequence. It uses Beam Search to transition from token to token and, in each step, i it:

  1. calculates the probabilities (of being the next token in the sequence) for each token in the vocabulary: $p_{\theta}(y_i | x, zi, y{1:i-1})$.
  2. calculates the transition probability (of being the next token in the sequence) for each token in the vocabulary by summing over the different retrieved documents (marginalization): $p_{\theta}^{'}(yi | x, y{1:i-1}) = \sum{z} {p{\eta}(zi | x) \cdot p{\theta}(y_i | x, zi, y{1:i-1})}$.
  3. runs Beam Search by choosing the K next tokens ($y_i$) with the highest transition probability.

RAG-sequence

The RAG-sequence model takes into account only one retrieved document per sequence that it generates. Specifically, for each retrieved document, it uses Beam Search to generate K sequences. Then, it just returns the sequence with the highest probability.

kandrio commented 10 months ago

Summary of Paper about FiD (Fusion in Decoder): Leveraging Passage Retrieval with Generative Models for Open Domain Question Answering

Preliminaries

Generative Models vs Extractive Models

Generative Models are trained to produce new text. They do this by learning the statistical relationships between words and phrases in a large corpus of text. When given a prompt, a generative model will try to produce text that is consistent with the statistical patterns it has learned.

NOTE: The authors of this paper interestingly found that, when increasing the size of the text database, become better and more accurate, contrary to extractive models.

Extractive Models are trained to find specific pieces of information in a text, that may be answering a question or identifying the main points of a passage. When given a query, an extractive model will return the parts of the text (spans) that it believes are relevant to the query.

Spans

Spans are pieces of text that are likely to be the answer to a question. For example, if the question is "What is the color of the cat?", an extractive model might extract the span: "The cat is black" as the answer.

Overview

Overall, the idea behind this paper is quite similar to the idea behind RAG (https://github.com/Tribler/tribler/issues/7438#issuecomment-1676395768), but with a twist...

Again, we have two main components:

The main difference between FiD and RAG is that:

kandrio commented 10 months ago

Augmenting LLMs with Knowledge Graphs

Graft-Net

Preliminaries

Question Subgraphs

A question subgraph is a subgraph of the knowledge base in which we have pruned the irrelevant (to a given question) nodes and edges. In addition, we have pruned the irrelevant documents as well, and we keep the ones that are likely to contain the answer.

The Knowledge Base

Triplestore Knowledge Base

A Triplestore knowledge base is a database that consists of subject-predicate-object triples. An example of such a triple is: (Subject: Albert Einstein, Predicate: was born in, Object: Ulm, Germany). Triples are a great form of representing factual knowledge because they capture the nature of the relationship between a subject and an object and can be easily processed by LLMs. We can view this Knowledge Base as a graph whose vertices are the various subjects and objects (entities) and the predicates are the edges between these entities. Each edge has a type that describes the kind of the relation between the connected entities.

Text Corpus

A text corpus D is a set of documents {d1, . . . , d|D|} where each document is a sequence of words di = (w1, . . . , w|di|). Specifically, in the context of this paper, a document is essentially a sentence, and an article is a collection of documents.

NOTE: It has a similar structure to the knowledge-base from RAG or FiD.

Entity Linking

We assume that there is a set L of links (v, dp) connecting entity v with a word at position p, in document d.

Graph Convolutional Network (GCN)

GCNs are great for classification of nodes in a graph-structured knowledge base. Here's how a GCN works for an input graph:

  1. For each node, collect the embeddings of all its neighbors
  2. Average these embeddings into one embedding
  3. Use that embedding as input to a CNN layer (matrix multiplication + non-linearity)
  4. Produce an output embedding for each node
  5. Repeat for the next layer.

NOTE: The more layers the GCN has, the more multi-hop reasoning the model will be able to perform, because it will gather information from more far away neighbors.

Relational GCN

One problem arises when the knowledge-base graph heterogeneous (more than one types of relations between entities). In that case, we want to take into consideration the type of relation that a node has with its neighbors before we average the embeddings. A relational GCN is similar to a regular GCN, but it uses a separate matrix for each type of relation. Therefore, when using a relational GCN, we aggregate the embeddings from all neighbors with a specific relation and we pass the averaged embedding into a separate CNN layer for each relation.

Lucene

Lucene is a Java library created by Apache that facilitates data search in a large corpus of text.

Overview

Question Subgraph Retrieval

The retrieval of the question subgraph, Gq happens in two parallel pipelines:

  1. Knowledge Base Retrieval
  2. Text Retrieval
Knowledge Base Retrieval

During the knowledge base retrieval, we retrieve a subgraph of the triplestore knowledge base as follows:

  1. First, given the question q, we retrieve a set of seed entities, Sq that are relevant to the question.
  2. Then, we run the Personalized PageRank (PPR) method (Haveliwala, 2002) around these seeds to identify other entities which might be an answer to the question. During PPR, we assign weights to edges around the seed entities. Each edge weight is essentially the cosine similarity between:
    • the question vector, v(q): average of all word vectors in the question
    • the relation vector, v(r): average of all word vectors in the relation corresponding to that edge
  3. In the end, we retain the top E entities v1, . . . , vE by PPR score, along with any edges between them, and add them to the question subgraph, Gq.
Text Retrieval

During the text retrieval phase, we retrieve documents (sentences) relevant to the question from the Wikipedia database. The text retrieval phase entails the following steps:

  1. First, we retrieve the top 5 most relevant Wikipedia articles. An article is a collection of documents (sentences). For that task, we use the weighted bag-of-words model from DrQA.
  2. Then, we populate a Lucene index with sentences from these articles, and retrieve the top ranking ones d1, ..., dD.
The Final Question Graph

The final question graph Gq consists of:

NOTE: Because the verticies of the graphs can be either entities or documents, the graph is considered heterogeneous.

Overview of Graft-Net

Graft-Net consists of the following stages:

  1. The Question Subgraph Retrieval stage. This is a characteristic of early fusion: the process of combining information from the knowledge base and text early in the model, i.e., before the graph neural network is used.
  2. The answer selection stage, where they use a GCN variant (1, 2, 3) to do binary classification (answer, not-answer) on the nodes of the subgraph.

Pull-Net

Pull-Net uses the text corpus to supplement information extracted from the Triplestore in order to answer multi-hop questions. The subjects and objects in the triples contain links to relevant documents in the text corpus. PullNet uses these links to produce more factually-based answers.

Like GRAFT-Net, Pull-Net has an initial phase where it retrieves a question subgraph Gq. However, Pull-Net learns how to construct the subgraph, rather than using an ad-hoc subgraph-building strategy. More specifically, PullNet relies on a small set of retrieval operations, each of which expands a graph node by retrieving new information from the knowledge base or the corpus. PullNet learns when and where to apply these “pull” operations with another graph CNN classifier. The “pull” classifier is weakly supervised, using question-answer pairs.

The end result is a learned iterative process for subgraph construction, which begins with a small subgraph containing only the question text and the entities which it contains, and gradually expands the subgraph to contain information from the knowledge base and corpus that are likely to be useful. The process is especially effective for multi-hop questions

synctext commented 10 months ago

Note the mission of the lab is new fundamental theory, with practical grounding (re-invent The Web, Web3). This means we are not interested in new machine learning theory. It is a tool which failed us in 2005, and now finally might become production usable in 2028. We have now several phd and msc students active on Machine learning:

"Check Your Facts and Try Again: Improving Large Language Models with External Knowledge and Automated Feedback", very facinating literature. All very detailed stuff and high-performance. Totally unsuitable for decentralised context with 1-2 billion connected smartphones with 8 cores each on average = 8-16 billion embedded CPU cores :astonished: The Web3 context will take 8-10 years to mature: your thesis can be that cardinal starting point. Show that it can be done and scale infinitely. No data center, Beyond Federated Learning to gossip learning with trust. Augmentation of knowledge by trustworthy users is probably the first-principle operand. For your thesis: decide how much distributed systems stuff is in there. Continuous LLM augmentation with a trust function (e.g. keeping up with Wikipedia edits on news stories) or the first fully self-organising LLM with self-evolution. Avoid competition with mainstream Big Tech labs, be Web3 ?

brainstorm

For achieving superhuman intelligence we need to invent a paradigm for storing all human knowledge and making it accessable for artificial reasoning engines or language models. @kandrio original thought, LLM are simply to huge to work with practically. If we are able to split the facts and the language model part we enable further growth. The mixing of knowledge and language is sub-optimal. We only need a new model of intelligence to fix this :smile: Bridge the semantic gap. Another old problem known for decades is the problem of ambiguity and synonyms when adding new facts. Just adding a fact also implies embedding it and adding metadata. Establishing global consensus on The Internet on facts is notoriously hard. We failed to solve digital democracy on fact writing. Crowdsourcing LLM augmentation is unsolved. Metadata pollution will severely cripple your system performance, see the detailed overlapping issue of Is Justin Bieber Gay?. Currently the human working at OpenAI decide on 4Chan/Reddit filtering versus unfiltered inclusion into their LLM. These OpenAI developers can also decide to feed live events into their LLM using an unfiltered Twitter feed: real-time event awareness.

Taxonomy of LLM augmentation. Explosion of a new topic which is only 3-4 years old. Lots of papers which build upon each other. Earliest paper is 2019! Title could be: LLM augmentation: a survey on this explosion. Superior to a taxonomy table is a "tree of knowledge evolution". More sensational survey title or grander scope: gathering all human knowledge for augmenting LLM with facts: a survey

update "LLM @ Android" Already very challenging and very sufficient for a TUDelft master thesis. Can you do minimal TFLite finetuning with size of LLM? "On-device LLM finetuning"

kandrio commented 10 months ago

Most Important Papers

Knowledge-Base Augmented Generation

Unstructured Knowledge Bases (text corpus)

Structured Knowledge Bases (graphs)

Search-Engine Augmented Generation

kandrio commented 10 months ago

Atlas (next generation of RAG): Few-shot Learning with Retrieval Augmented Language Models (2022)

Atlas is essentially the next generation of RAG, for few-shot learning tasks.

When performing a task, from question answering to generating Wikipedia articles, Atlas starts by retrieving the top-k relevant documents from a large corpus of text with the retriever. Then, these documents are fed to the language model, along with the query, which in turn generates the output. Both the retriever and the language model are based on pre-trained transformer networks.

Atlas consists of:

Retriever

Like RAG, it entails a BERTq and a BERTd encoder. Unlike RAG, during fine-tuning of the retriever, Atlas trains both BERTq and a BERTd (not only BERTq). Hence, the BERTd embeddings for each document in the BERTBASE need to be regularly updated so that they are in-sync with the updated BERTd. This is a computationally expensive task.

IMPORTANT: Atlas proposes jointly pre-training both the retriever and the generator model (similar to REALM) unlike RAG which uses pre-trained models and trains end-to-end only during fine-tuning.

kandrio commented 9 months ago

REALM: Retrieval-Augmented Language Model Pre-Training (2020)

The first method to pre-train jointly the retriever and the generator. REALM uses an architecture that we've seen before (in RAG, FiD), but proposes a pre-training technique that yields great models.

Components

Just like RAG, we have two main components:

In REALM, all of the above models are trained during pre-training.

Initialization

At the beginning of training, if the retriever does not have good embeddings for Embedinput(x) and Embeddoc(z), the retrieved documents, z will likely be unrelated to x. This causes the generator to learn to ignore the retrieved documents. Once this occurs, the retriever does not receive a meaningful gradient and cannot improve, creating a vicious cycle.

To avoid this cold-start problem, the authors warm-start the retriever (Embedinput + Embeddoc) using a simple training objective known as the Inverse Cloze Task (ICT) where, given a sentence, the model is trained to retrieve the document where that sentence came from.

For the generator, the authors warm-start it with BERT pre-training. Specifically, they use the uncased BERT-base model (12 layers, 768 hidden units, 12 attention heads).

Pre-training

The unsupervised pre-training method that REALM proposes goes as follows:

  1. We randomly select sentences from the text corpus and mask specific tokens from each one.
  2. REALM receives as input a masked query, q. An example would be: "The [MASK] at the top of the pyramid".
  3. REALM outputs its token prediction (correct answer is "pyramidion")
  4. We backpropagate through the parameters, $\theta$ of the the retriever p$\theta$(z|x), and $\phi$, of the generator p$\phi$(y|x,z).

Computational Challenges

During pre-training, both the Embeddoc and the Embedinput are trained. Because the Embeddoc is updated during pre-training, after each backpropagation step, we need to:

  1. re-compute the document embeddings
  2. re-calculate the document index (in order to perform MIPS)

This is a computationally expensive task, especially for huge databases, such as Wikipedia which they used in this paper. So, the authors designed REALM such that the embedding updates happen every 100 backpropagation steps, as an asynchronous process.

Fine-tuning

The supervised fine-tuning method that the authors used in order to evaluate REALM on Open-domain Question Answering (Open-QA) goes as follows:

  1. We collect Q-A tuples, such as: ("What's the angle of an equilateral triangle", "60 degrees").
  2. REALM receives Q as input.
  3. REALM outputs its prediction.
  4. Like in pre-training, we backpropagate through the parameters of the the retriever p$\theta$(z|x), and $\phi$, of the generator p$\phi$(y|x,z), but this time we leave the Embeddoc untouched. Therefore, fine-tuning is much less computationally expensive.
kandrio commented 9 months ago

RETRO: Improving Language Models by Retrieving from Trillions of Tokens (2022)

This paper's breakthrough is that it managed to pre-train and augment a relatively small LLM (25×fewer parameters than GPT-3) with a database that is 2 trillion tokens large (1000×larger than similar retrieval-augmented LLMs).

One main difficulty with augmenting LLMs with external knowledge-bases is that training the retriever component can be computationally expensive, because while the document encoder becomes better, we need to re-compute the embeddings for each passage in the database. In this paper, they used a pre-trained document encoder, so they calculate the document embeddings once and they do not update them again . Therefore, the main bottleneck that they're facing when accessing the external database is to find the K nearest documents to the input query.

One main difference with related work is that in RETRO they don't retrieve single sentences, but chunks (a retrieved sentence along with the following sentence). I don't yet understand if that helps.

Overview

Here's an overview of how RETRO produces an answer to an input query, q:

  1. It splits the input query into chunks of 4 tokens
  2. For each chunk, cq of q, RETRO: a. calculates its embedding b. finds the 2 nearest neighbors in its knowledge base c. encodes cq through the encoder d. encodes the 2 nearest neighbors through the encoder e. interleaves the encodings of the nearest neighbors with the query chunk embeddings to perform cross-attention. NOTE: Neighbours of the first chunk only affect the last token of the first chunk and tokens from the second chunk.

RETRO manages to perform attention in complexity that is linear to the number of retrieved passages.

kandrio commented 9 months ago

LaMDA: Language Models for Dialog Applications (2022)

In this paper by Google, the authors manage to augment a language generation model with what they call a Toolset (TS).

The Toolset (TS)

The Toolset consists of:

  1. a calculator
  2. a translator
  3. an information retrieval system

The Toolset takes a single string as input and outputs a list of one or more strings. Each tool in TS expects a string and returns a list of strings. For example, the information retrieval system can take “How old is Rafael Nadal?”, and output [“Rafael Nadal / Age / 35”].

The information retrieval system is also capable of returning snippets of content from the open web, with their corresponding URLs. The TS tries an input string on all of its tools, and produces a final output list of strings by concatenating the output lists from every tool in the following order: calculator, translator, and information retrieval system. A tool will return an empty list of results if it can’t parse the input (e.g., the calculator cannot parse “How old is Rafael Nadal?”), and therefore does not contribute to the final output list.

NOTE: Little information is given on how the information retrieval system works, apart from the fact that it entails a database, but also can provide web snippets along with their URLs.

The Architecture

LaMDA consists of two main sub-models:

  1. LaMDA-Base: A regular generative model that is pre-trained on a large dataset. LaMDA-Base is the first model to receive a query from the user. It then generates a response that is checked and refined by LaMDA-Research.
  2. LaMDA-Research: A generative model that usually receives the output of LaMDA-Base as input and is fine-tuned to choose the recipient of its output (the Toolset or the User). In general, LaMDA-Research queries the Toolset in a loop, until it has sufficient information to generate a final response to the user.
kandrio commented 9 months ago

Internet-Augmented Dialogue Generation (2021)

Their method consists of two components:

We can train each of these modules separately if we have supervised data available for both tasks, the first module requiring (context, search query) pairs, and the second module requiring (context, response) pairs.

The search engine is a black box in this system, and could potentially be swapped out for any method. In IADG, they use the Bing Search API for their experiments to generate a list of URLs for each query. Then, they use these URLs as keys to find their page content.

kandrio commented 9 months ago

SeeKeR: Language Models that Seek for Knowledge: Modular Search & Generation for Dialogue and Prompt Completion (2022)

One model to do both retrieval and generation (wow)

kandrio commented 9 months ago

Draft

@synctext Here is the first complete draft of my literature survey:

Here's a snippet of my taxonomy table:

taxonomy

What do you think?

kandrio commented 9 months ago

Code Implementation

I recently dived into the implementation details of Retrieval-Augmented Generation (RAG), one of the most influential papers that I had to review for my Literature Survey (see this comment for a comprehensive review). RAG focuses on knowledge-intensive NLP tasks, as opposed to dialogue intensive tasks that a number of recent papers focus on.

The authors of RAG, have open-sourced a specific version of their work, RAG-token, as part of the transformers Python library by Hugging Face.

I was able to access that model, and write an example script where I employed RAG to answer a simple question: "Who holds the record in 100m freestyle?"

Here is my script:


from transformers import RagTokenizer, RagRetriever, RagTokenForGeneration

# a tokenizer receives an input text and breaks it into a list of tokens
# this way, it's easier for the model to understand the input query
tokenizer = RagTokenizer.from_pretrained("facebook/rag-token-nq")

# initialize a pre-trained RAG Retriever which has access to a "dummy" subset of Wikipedia
retriever = RagRetriever.from_pretrained(
    "facebook/rag-token-nq",
    index_name="exact",
    use_dummy_dataset=True)

# initialize the RAG-token model that will generate the final answer to our query
# the generator of RAG-token will receive the retrieved evidence by the retriever
# along with the input question and it will produce an answer
model = RagTokenForGeneration.from_pretrained(
    "facebook/rag-token-nq",
    retriever=retriever)

# define our question, and tokenize it. Correct answer should be "michael phelps"
input_dict = tokenizer.prepare_seq2seq_batch(
    "who holds the record in 100m freestyle",
    return_tensors="pt")

# pass the question as input to RAG-token
generated = model.generate(input_ids=input_dict["input_ids"])

# print the answer
print(tokenizer.batch_decode(generated, skip_special_tokens=True))

Here is a screenshot that shows what RAG replied to my question (take a look at the bottom):

CC @synctext

synctext commented 9 months ago

WOW :clap: Impressive work. Only very minor comments:

kandrio commented 9 months ago

Updated version of the paper after @synctext's useful comments: