[x] SFT
(Def:
Supervised fine-tuning refers to the process of taking a pre-trained machine learning model and adapting it to a specific task using a labeled dataset. The pre-trained model, typically a deep learning model, has already been trained on a large dataset and has learned to capture important features and patterns in the data. Fine-tuning involves training the model on a smaller, task-specific dataset with labeled examples, which helps the model learn to make accurate predictions for the specific task.
The steps involved in supervised fine-tuning typically include:
Selecting a pre-trained model: Choose a pre-trained model that has been trained on a large dataset and is relevant to the task at hand. Popular choices for pre-trained models include BERT, GPT, and ResNet, depending on the task (natural language processing, image recognition, etc.).
Preparing the dataset: Collect and preprocess a labeled dataset that is specific to the task. The dataset should include input data along with corresponding labels or target values.
Fine-tuning the model: Use the labeled dataset to train the pre-trained model. This involves updating the model's weights and biases using backpropagation, based on the errors between the model's predictions and the true labels in the dataset. The fine-tuning process can be done using a variety of optimization algorithms, such as stochastic gradient descent or Adam.
Evaluating the model: After fine-tuning, evaluate the model's performance on a separate validation or test dataset to ensure that it is making accurate predictions for the task.
Supervised fine-tuning is a powerful approach that can leverage the knowledge captured by pre-trained models to achieve high performance on a wide range of tasks with relatively small labeled datasets.)
[ ] RLHF
(Def:
Reinforcement Learning from Human Feedback (RLHF) is a method used to train machine learning models, particularly in the field of natural language processing. The method combines reinforcement learning with human feedback to improve model performance and generalization.
The general steps involved in RLHF are as follows:
Pre-training: The model is first pre-trained on a large dataset, often using unsupervised learning techniques.
Collecting Human Feedback: Human feedback is collected on the model's output. This feedback can take various forms, such as ranking different model outputs, providing corrections, or giving rewards based on the quality of the output.
Fine-tuning with Reinforcement Learning: The model is then fine-tuned using reinforcement learning techniques. The human feedback is used as a reward signal to guide the model's learning process. The model updates its parameters to maximize the cumulative reward.
Evaluation: The model is evaluated on various tasks to measure its performance and generalization.
RLHF has been successfully applied to various natural language processing tasks, such as text summarization, machine translation, and dialogue systems. By incorporating human feedback, the method helps to address some of the limitations of traditional supervised learning approaches and improve the model's ability to generate more relevant and coherent text.)
[ ] RLAIF
(Def:
In the case of RLAIF, the feedback used to train the agent could come from another AI system rather than from a human or the environment. This could involve using the output of one AI model as a reward signal for another AI model, or using the predictions of one AI model to guide the exploration of another AI model.
This concept could potentially be useful in scenarios where human feedback is expensive or difficult to obtain, and where AI models can be used to provide informative feedback to guide the learning process. However, further research and development would be needed to explore the potential benefits and challenges of this approach.)
[ ] RM (reward model)
(Def:
In reinforcement learning, a reward model is a function that assigns a numerical reward to each state or state-action pair in an environment, based on how desirable or valuable that state or action is for achieving the agent's goal. The reward model is a critical component of the reinforcement learning framework, as it provides the feedback signal that guides the agent's learning process.
The reward model is typically defined by the designer of the reinforcement learning system and reflects the task that the agent is trying to solve. For example, in a game-playing agent, the reward model might assign a positive reward for winning the game, a negative reward for losing, and smaller rewards or penalties for other in-game events.
The goal of the agent in reinforcement learning is to learn a policy that maximizes its expected cumulative reward over time. This involves exploring the environment, taking actions, receiving rewards from the reward model, and updating its policy based on the feedback.
In addition to being used in traditional reinforcement learning, reward models can also be used in other contexts, such as inverse reinforcement learning, where the goal is to learn a reward model that explains observed behavior, or in preference-based reinforcement learning, where the reward model is learned from human feedback.)
[ ] Constitutional AI
(Def:
As AI systems become more capable, we would like to enlist their help to supervise other AIs. We experiment with methods for training a harmless AI assistant through self- improvement, without any human labels identifying harmful outputs. The only human oversight is provided through a list of rules or principles, and so we refer to the method as ‘Constitutional AI’.
Bai, Y., Kadavath, S., Kundu, S., Askell, A., Kernion, J., Jones, A., ... & Kaplan, J. (2022). Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073.)
[ ] Self-revision
(Def:
"Self-revision" generally refers to the process of reviewing and improving one's own work. This concept can be applied to various contexts, such as writing, learning, and problem-solving.
In writing, self-revision involves revisiting a draft to identify and correct errors, improve clarity, and enhance the overall quality of the piece. This might include checking for grammar and spelling mistakes, ensuring that ideas are clearly and logically presented, and refining the language used.
In learning, self-revision can refer to the process of reviewing one's understanding and knowledge of a topic and identifying areas that need further study or clarification.
In problem-solving, self-revision might involve revisiting a solution to a problem to ensure that it is accurate, efficient, and effective.
In all these contexts, self-revision is an important skill that can help improve the quality of one's work and enhance learning and problem-solving abilities. It involves a willingness to critically evaluate one's own work, identify areas for improvement, and make the necessary changes to achieve better outcomes.)
[ ] Sequence-level objectives
(Def:
Sequence-level objectives are goals or tasks that are defined over entire sequences of data, as opposed to individual data points. In machine learning, and especially in natural language processing, sequence-level objectives are commonly used when working with data that has a sequential structure, such as text, audio, or video.
For example, in natural language processing, common sequence-level objectives include:
Machine Translation: The goal is to translate a sequence of words in one language into a sequence of words in another language.
Text Summarization: The goal is to produce a concise and informative summary of a long sequence of text.
Named Entity Recognition: The goal is to identify and label entities (e.g., names, locations, dates) in a sequence of text.
Sentiment Analysis: The goal is to determine the sentiment expressed in a sequence of text, such as positive, negative, or neutral.
In all these examples, the objective is defined over entire sequences of text, and the output of the model is also a sequence.
Sequence-level objectives are often more challenging to achieve than point-level objectives because they require the model to capture dependencies and relationships between different parts of the sequence. This often requires more complex model architectures, such as recurrent neural networks or transformers, and advanced training techniques, such as teacher forcing or scheduled sampling.)
[ ] head-to-head
(Def:
In a head-to-head competition, the entities are compared against each other to determine which one performs better or is more effective.)
[ ] off-the-shelf LLM
(Def:
"Off-the-shelf" refers to a product that is ready to use without any customization or modification.
In the context of Large Language Models (LLMs), an "off-the-shelf" LLM refers to a pre-trained language model that can be directly used for various natural language processing tasks without any further training or fine-tuning. These models have been trained on large datasets and have learned to generate coherent and contextually relevant text based on the input they receive.
Examples of off-the-shelf large language models include GPT-3 by OpenAI, BERT by Google, and T5 by Google. These models can be used for a variety of tasks such as text completion, sentiment analysis, text summarization, question-answering, and more.
It is worth noting that while off-the-shelf LLMs can be used for a wide range of tasks, they might not always perform as well as a model that has been specifically fine-tuned for a particular task. Fine-tuning a pre-trained model on a task-specific dataset can help the model to better understand the nuances of the task and improve its performance.)
[ ] statistically significantly different
(Def:
When we say that two groups are "statistically significantly different," we mean that the observed difference between the groups is unlikely to have occurred by random chance, and that there is likely a true difference between the populations from which the groups were drawn.
Statistical significance is often tested using hypothesis testing. In a hypothesis test, we start with a null hypothesis that there is no difference between the groups, and an alternative hypothesis that there is a difference. We then calculate a p-value, which is the probability of observing the data (or something more extreme) if the null hypothesis is true. If the p-value is less than a predetermined significance level (commonly 0.05), we reject the null hypothesis and conclude that the difference between the groups is statistically significant.
It's important to note that "statistically significant" does not necessarily mean that the difference is large or practically important, only that it is unlikely to have occurred by chance. It's also possible for a difference to be statistically significant but not clinically or practically significant.)
[ ] prompt engineering / prompting
(Def:
Prompt engineering, or prompting, refers to the process of designing input prompts to guide a machine learning model, particularly large language models like GPT-3, to generate desired outputs. By carefully crafting the prompt, users can influence the behavior of the model and improve its performance on specific tasks.
Prompt engineering can be seen as a form of "programming" for machine learning models, where instead of writing code to specify the desired behavior, users provide examples or instructions in the form of natural language prompts. This approach is particularly useful for few-shot or zero-shot learning, where the model is able to generalize from a small number of examples or even a single example to perform a specific task.
Examples of prompt engineering include:
Task-Specific Prompts: Providing a specific instruction or question that clearly defines the task, e.g., "Translate the following English text to French:"
Example-Based Prompts: Providing examples of the desired behavior, e.g., "Here are some examples of English text and their French translations:"
Contextual Prompts: Providing additional context or information that helps the model understand the task, e.g., "You are a helpful assistant that translates English text to French."
Prompt engineering can be a powerful tool for improving the performance of machine learning models, but it also requires careful consideration and experimentation to find the most effective prompts for a given task.)
[ ] prompt template
(Def:
A prompt template is a structured format for creating prompts used in machine learning, particularly with large language models like GPT-3. Prompt templates help guide the model to generate the desired output by providing clear and consistent instructions, examples, or context. They are especially useful for few-shot learning, where the model needs to understand the task based on a small number of examples.
Here's an example of a prompt template for a text summarization task:
Task: Summarize the following text in one sentence.
Text: [Insert text to be summarized here]
Summary:
In this template, "Task" provides clear instruction on what needs to be done, "Text" is a placeholder for the input text to be summarized, and "Summary" is where the model's output will be generated.
The user can then fill in the placeholder with the specific text they want to summarize, and the model will generate a summary in response to the prompt.
By using a prompt template, users can ensure that their instructions are clear and consistent, making it easier for the model to understand the task and generate the desired output.)
[ ] chain-of-thought rationales
(Def:
Chain-of-thought rationales are a way of enhancing language models by providing step-by-step explanations or reasonings for the answers they generate. These rationales break down complex tasks into intermediate steps, making it easier for the model to arrive at the correct answer and for users to understand the model's thought process.
When using chain-of-thought rationales, users can prompt the model to provide a detailed explanation of how it arrived at its answer. For example:
Prompt:
What are the prime factors of 60?
Rationale:
First, we list the factors of 60: 1, 2, 3, 4, 5, 6, 10, 12, 15, 20, 30, 60.
Next, we identify which of these factors are prime: 2, 3, 5.
Therefore, the prime factors of 60 are 2, 3, and 5.
In this example, the model provides a step-by-step explanation of how it found the prime factors of 60, making it easier for the user to understand the answer and the process used to arrive at it.
Chain-of-thought rationales can be particularly useful for complex tasks that require multiple steps or considerations to solve, such as math problems, logic puzzles, or reasoning tasks. They can also enhance the transparency and interpretability of language models, making it easier for users to understand and trust their outputs.)
[x] 0-shot
(Def:
"Zero-shot" or "0-shot" learning refers to a machine learning scenario where a model is able to perform a task without having seen any examples of that specific task during training. In other words, the model can generalize to new tasks that were not part of its training data.
Zero-shot learning is commonly discussed in the context of natural language processing (NLP) and computer vision. For example, in NLP, a zero-shot learning model might be able to generate text in response to a prompt that it has never seen before, based on its understanding of language acquired during training. In computer vision, a zero-shot learning model might be able to recognize objects or scenes that it has never seen before, based on its understanding of visual features.
Zero-shot learning is often enabled by pre-training models on large, diverse datasets that capture a broad range of knowledge. The idea is that by learning from a wide variety of examples, the model can develop a rich understanding of the domain and be able to generalize to new tasks that are related to the training data, even if they are not exactly the same.)
[x] few-shot example
(Def:
Few-shot learning refers to training a machine learning model on a very small dataset, which typically consists of only a few examples per class. The challenge is to develop a model that can generalize well to new, unseen data despite the limited amount of training data.
Here's an example of few-shot learning in image classification:
Imagine you have a dataset with images of different types of animals, and you want to train a model to classify these images into their respective animal categories. However, you only have a few images for each animal type. This is where few-shot learning comes into play.
In a few-shot learning setting, you might have the following:
5 examples of cats
5 examples of dogs
5 examples of birds
This would be referred to as a "5-shot" learning problem because you have 5 examples for each class.
To solve this problem, you can use a few-shot learning approach, such as:
Transfer Learning: Pre-train a model on a large dataset with many examples of different animal types, and then fine-tune the model on your small dataset with only a few examples per class.
Meta-Learning: Train a model in a way that it can quickly adapt to new tasks with only a few examples. This involves training the model on a variety of tasks with limited data, and then fine-tuning it on the specific task of interest.
Data Augmentation: Increase the size of your small dataset by applying various transformations (e.g., rotation, scaling, cropping) to the existing images to create new, augmented images.
By using these and other few-shot learning techniques, you can train a model that can generalize well to new, unseen examples of animal types, despite the limited amount of training data.)
[ ] Kullback-Leibler (KL) divergence loss
(Def:
Kullback-Leibler (KL) divergence is a measure of how one probability distribution diverges from a second, expected probability distribution. It is used in various fields, including machine learning, statistics, and information theory.
In the context of machine learning, the KL divergence can be used as a loss function to measure the difference between the predicted probability distribution and the true distribution. This is often referred to as the KL divergence loss or KL loss. The goal of training a model with KL loss is to minimize the divergence between the predicted and true distributions, thus making the model's predictions as close as possible to the true distribution.
The KL divergence loss is commonly used in various machine learning tasks, including classification, regression, and generative models, where it can be used to measure the difference between the model's output and the ground truth data. It is also used in variational autoencoders (VAEs) as part of the loss function to regularize the learned latent space.)
[x] self-consistency
(Def:
In the context of machine learning, self-consistency refers to the property that a model's predictions or outputs are consistent with each other. This can mean that the model's predictions are consistent across different inputs that should lead to the same output, or that the model's predictions are consistent with its internal representations and learned knowledge.
Self-consistency is an important property for many types of models, including language models, where it is desirable that the model produces consistent and coherent text output. For example, if a language model is asked to generate a paragraph of text, the sentences within the paragraph should be consistent with each other in terms of grammar, style, and content.
One way to achieve self-consistency in a model is to use techniques like regularization, which can help to enforce consistency between the model's predictions and its internal representations. Another approach is to use ensemble methods, where multiple models are combined to produce a single output, and any inconsistencies between the models can be resolved through voting or averaging.
Self-consistency is also related to the concept of robustness, which refers to a model's ability to produce consistent and accurate predictions in the presence of noise, perturbations, or adversarial examples. Models that are self-consistent and robust are generally more reliable and trustworthy.)
[ ] model distillation
(Def:
Model distillation, also known as knowledge distillation, is a technique used in machine learning to improve the efficiency of a model by transferring knowledge from a larger, more complex model (the "teacher" model) to a smaller, more efficient model (the "student" model). The goal of model distillation is to create a student model that performs similarly to the teacher model but with less computational overhead.
The process of model distillation involves the following steps:
Train a large and complex teacher model on a specific task, such as image classification, natural language processing, or speech recognition.
Use the teacher model to generate predictions or embeddings for a dataset.
Train a smaller student model to mimic the predictions or embeddings generated by the teacher model.
Fine-tune the student model on the original task.
The student model learns to replicate the behavior of the teacher model by training on the teacher's outputs rather than the ground-truth labels. This allows the student model to capture the knowledge and patterns learned by the teacher model, while being more efficient in terms of computation and memory requirements.
Model distillation can be particularly useful in scenarios where computational resources are limited, such as on mobile devices or embedded systems, where deploying a large and complex model may not be feasible. It can also be used to improve the performance of smaller models by leveraging the knowledge learned by larger, more accurate models.)
[ ] Advantage Actor Critic (A2C)
(Def:
Advantage Actor-Critic (A2C) is a reinforcement learning algorithm that combines the strengths of actor-critic methods and advantage functions.
Here's how the A2C algorithm works:
Actor: The actor takes the current state of the environment as input and outputs an action to be taken. The actor is parameterized by a policy PI(a|s; θ) , where a is the action, s is the state, and θ are the parameters of the policy. The policy defines the probability of taking action a in state s.
Critic: The critic takes the current state of the environment as input and outputs an estimate of the value of that state, i.e., the expected cumulative future rewards. The critic is parameterized by a value function V(s;ω), where s is the state and ω are the parameters of the value function.
Advantage Function: The advantage function A(s,a)=Q(s,a)−V(s) is used to calculate the advantage of taking action a in state s, where Q(s,a) is the action-value function. The advantage function measures how much better (or worse) an action is compared to the average action in a given state. The advantage function is used to update the policy in the direction that increases the probability of taking actions with higher advantages.
Training: The parameters of the actor and critic are updated using gradient ascent and gradient descent, respectively, to maximize the expected cumulative future rewards.
A2C is an on-policy algorithm, meaning it uses the current policy to generate samples for training. It is also a synchronous version of the Asynchronous Advantage Actor-Critic (A3C) algorithm, where multiple agents are used to explore the environment in parallel, and the gradients are accumulated before updating the parameters.
A2C has been used successfully in various applications, including robotics, video games, and autonomous control.)
[ ] Proximal Policy Optimization (PPO)
(Def:
Proximal Policy Optimization (PPO) is a popular reinforcement learning algorithm introduced by OpenAI in 2017. It is designed to improve the stability and sample efficiency of training compared to other algorithms such as Trust Region Policy Optimization (TRPO).
PPO aims to optimize the policy while ensuring that the new policy is not too far from the old policy. This is important to prevent the policy from changing too drastically, which can lead to instability in training.
The key idea behind PPO is to use a surrogate objective function that penalizes changes to the policy that move it too far from the old policy.
PPO aims to maximize the surrogate objective function, which encourages the new policy to be close to the old policy while still improving the policy to increase cumulative rewards. The clipping function ensures that the probability ratio does not deviate too far from 1, which prevents large policy updates.
PPO has been successful in a wide range of reinforcement learning tasks, including robotics, video games, and control problems. It is known for its simplicity, ease of implementation, and good performance across different tasks.)
[x] AI Labeler Alignment
(Def:
It refers to the process of ensuring that the labels generated by an AI system are aligned with the intended task or objective.
Labeling is a crucial step in supervised learning, where a model is trained on a dataset consisting of input-output pairs. The "labels" are the output or ground-truth values that the model aims to predict. For example, in a classification task, the labels would be the classes to which the input data points belong.
Alignment in this context would mean ensuring that the labels generated by the AI system are accurate and consistent with the task at hand. This can involve several steps, such as:
Data Cleaning: Ensuring that the dataset is free of errors, inconsistencies, and missing values that could affect the quality of the labels.
Data Preprocessing: Transforming the data into a format that is suitable for the model, such as normalizing numerical values or encoding categorical variables.
Label Verification: Checking that the labels are correct and consistent with the task. This can involve manual verification by human annotators or automated checks using other data sources.
Model Evaluation: Assessing the performance of the model using metrics that are relevant to the task, such as accuracy, precision, recall, or F1 score.
Ensuring labeler alignment is important for the success of an AI system, as accurate and consistent labels are essential for training a model that generalizes well to new, unseen data.)
[ ] Pairwise Accuracy
(Def:
Pairwise accuracy is a performance metric used in ranking or ordinal regression tasks, where the goal is to predict the correct order of items rather than their absolute values. It is often used in information retrieval, recommendation systems, and other applications where the relative ordering of items is more important than their precise scores.
Pairwise accuracy is calculated as the proportion of pairs of items that are correctly ordered by the model. For example, consider a ranking task where the goal is to order a list of items from best to worst. If the model correctly orders 8 out of 10 possible pairs of items, then the pairwise accuracy would be 0.8 or 80%.
The formula for pairwise accuracy is:
Pairwise Accuracy = (Number of Correctly Ordered Pairs)/ (Total Number of Pairs)
Pairwise accuracy is a useful metric for ranking tasks because it directly measures the model's ability to order items correctly. It is less sensitive to the absolute values predicted by the model and focuses more on the relative ordering, which is often the most important aspect in ranking tasks.)
[x] Win Rate
(Def:
Win rate is a performance metric commonly used in competitive games and other situations where two or more entities compete against each other, and the outcome is a win, loss, or draw. Win rate is calculated as the ratio of the number of wins to the total number of games or matches played. The formula for win rate is:
Win Rate = (Number of Wins)/(Total Number of Games)
Win rate is often expressed as a percentage, where a win rate of 0.5 or 50% indicates that the entity has won half of the games played.
In the context of machine learning models, win rate can be used as a performance metric in scenarios where the model is competing against other models or baselines. For example, in a game-playing AI, the win rate would be the proportion of games that the AI wins against its opponents. In this case, a higher win rate would indicate that the AI is more successful at the game.
Win rate can also be used in A/B testing and other experimental settings, where different versions of a product or feature are compared against each other to determine which one performs better. In this case, the win rate would be the proportion of times that the new version outperforms the old version.)
[ ] MDP
(Def:
A Markov Decision Process (MDP) is a mathematical model used to describe a decision-making problem in a fully observable environment. MDPs are widely used in reinforcement learning and decision theory to represent problems where an agent takes actions in an environment to maximize some notion of cumulative reward.
An MDP is defined by the following components:
States (S): A set of possible states that the environment can be in.
Actions (A): A set of possible actions that the agent can take.
Transition Probabilities (P): A function that specifies the probability of reaching a particular state after taking a particular action in a particular state, i.e., P(s'∣s,a) where s and s' are the current and next states, respectively, and a is the action taken.
Rewards (R): A function that specifies the reward received by the agent after taking a particular action in a particular state, i.e., R(s,a).
Discount Factor (γ): A value between 0 and 1 that represents the agent's preference for immediate rewards over future rewards. A discount factor of 1 means the agent values future rewards as much as immediate rewards, while a discount factor of 0 means the agent only cares about immediate rewards.
The objective of the agent in an MDP is to find a policy π that specifies the action to take in each state to maximize the expected cumulative reward. The policy can be deterministic, where it specifies a single action to take in each state, or stochastic, where it specifies a probability distribution over actions in each state. The optimal policy is the one that maximizes the expected cumulative reward over time.
Solving an MDP involves finding the optimal policy that maximizes the expected cumulative reward. There are several algorithms for solving MDPs, including value iteration, policy iteration, and Q-learning.)
[x] SFT (Def: Supervised fine-tuning refers to the process of taking a pre-trained machine learning model and adapting it to a specific task using a labeled dataset. The pre-trained model, typically a deep learning model, has already been trained on a large dataset and has learned to capture important features and patterns in the data. Fine-tuning involves training the model on a smaller, task-specific dataset with labeled examples, which helps the model learn to make accurate predictions for the specific task. The steps involved in supervised fine-tuning typically include: Selecting a pre-trained model: Choose a pre-trained model that has been trained on a large dataset and is relevant to the task at hand. Popular choices for pre-trained models include BERT, GPT, and ResNet, depending on the task (natural language processing, image recognition, etc.). Preparing the dataset: Collect and preprocess a labeled dataset that is specific to the task. The dataset should include input data along with corresponding labels or target values. Fine-tuning the model: Use the labeled dataset to train the pre-trained model. This involves updating the model's weights and biases using backpropagation, based on the errors between the model's predictions and the true labels in the dataset. The fine-tuning process can be done using a variety of optimization algorithms, such as stochastic gradient descent or Adam. Evaluating the model: After fine-tuning, evaluate the model's performance on a separate validation or test dataset to ensure that it is making accurate predictions for the task. Supervised fine-tuning is a powerful approach that can leverage the knowledge captured by pre-trained models to achieve high performance on a wide range of tasks with relatively small labeled datasets.)
[ ] RLHF (Def: Reinforcement Learning from Human Feedback (RLHF) is a method used to train machine learning models, particularly in the field of natural language processing. The method combines reinforcement learning with human feedback to improve model performance and generalization. The general steps involved in RLHF are as follows: Pre-training: The model is first pre-trained on a large dataset, often using unsupervised learning techniques. Collecting Human Feedback: Human feedback is collected on the model's output. This feedback can take various forms, such as ranking different model outputs, providing corrections, or giving rewards based on the quality of the output. Fine-tuning with Reinforcement Learning: The model is then fine-tuned using reinforcement learning techniques. The human feedback is used as a reward signal to guide the model's learning process. The model updates its parameters to maximize the cumulative reward. Evaluation: The model is evaluated on various tasks to measure its performance and generalization. RLHF has been successfully applied to various natural language processing tasks, such as text summarization, machine translation, and dialogue systems. By incorporating human feedback, the method helps to address some of the limitations of traditional supervised learning approaches and improve the model's ability to generate more relevant and coherent text.)
[ ] RLAIF (Def: In the case of RLAIF, the feedback used to train the agent could come from another AI system rather than from a human or the environment. This could involve using the output of one AI model as a reward signal for another AI model, or using the predictions of one AI model to guide the exploration of another AI model. This concept could potentially be useful in scenarios where human feedback is expensive or difficult to obtain, and where AI models can be used to provide informative feedback to guide the learning process. However, further research and development would be needed to explore the potential benefits and challenges of this approach.)
[ ] RM (reward model) (Def: In reinforcement learning, a reward model is a function that assigns a numerical reward to each state or state-action pair in an environment, based on how desirable or valuable that state or action is for achieving the agent's goal. The reward model is a critical component of the reinforcement learning framework, as it provides the feedback signal that guides the agent's learning process. The reward model is typically defined by the designer of the reinforcement learning system and reflects the task that the agent is trying to solve. For example, in a game-playing agent, the reward model might assign a positive reward for winning the game, a negative reward for losing, and smaller rewards or penalties for other in-game events. The goal of the agent in reinforcement learning is to learn a policy that maximizes its expected cumulative reward over time. This involves exploring the environment, taking actions, receiving rewards from the reward model, and updating its policy based on the feedback. In addition to being used in traditional reinforcement learning, reward models can also be used in other contexts, such as inverse reinforcement learning, where the goal is to learn a reward model that explains observed behavior, or in preference-based reinforcement learning, where the reward model is learned from human feedback.)
[ ] Constitutional AI (Def: As AI systems become more capable, we would like to enlist their help to supervise other AIs. We experiment with methods for training a harmless AI assistant through self- improvement, without any human labels identifying harmful outputs. The only human oversight is provided through a list of rules or principles, and so we refer to the method as ‘Constitutional AI’. Bai, Y., Kadavath, S., Kundu, S., Askell, A., Kernion, J., Jones, A., ... & Kaplan, J. (2022). Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073.)
[ ] Self-revision (Def: "Self-revision" generally refers to the process of reviewing and improving one's own work. This concept can be applied to various contexts, such as writing, learning, and problem-solving. In writing, self-revision involves revisiting a draft to identify and correct errors, improve clarity, and enhance the overall quality of the piece. This might include checking for grammar and spelling mistakes, ensuring that ideas are clearly and logically presented, and refining the language used. In learning, self-revision can refer to the process of reviewing one's understanding and knowledge of a topic and identifying areas that need further study or clarification. In problem-solving, self-revision might involve revisiting a solution to a problem to ensure that it is accurate, efficient, and effective. In all these contexts, self-revision is an important skill that can help improve the quality of one's work and enhance learning and problem-solving abilities. It involves a willingness to critically evaluate one's own work, identify areas for improvement, and make the necessary changes to achieve better outcomes.)
[ ] Sequence-level objectives (Def: Sequence-level objectives are goals or tasks that are defined over entire sequences of data, as opposed to individual data points. In machine learning, and especially in natural language processing, sequence-level objectives are commonly used when working with data that has a sequential structure, such as text, audio, or video. For example, in natural language processing, common sequence-level objectives include: Machine Translation: The goal is to translate a sequence of words in one language into a sequence of words in another language. Text Summarization: The goal is to produce a concise and informative summary of a long sequence of text. Named Entity Recognition: The goal is to identify and label entities (e.g., names, locations, dates) in a sequence of text. Sentiment Analysis: The goal is to determine the sentiment expressed in a sequence of text, such as positive, negative, or neutral. In all these examples, the objective is defined over entire sequences of text, and the output of the model is also a sequence. Sequence-level objectives are often more challenging to achieve than point-level objectives because they require the model to capture dependencies and relationships between different parts of the sequence. This often requires more complex model architectures, such as recurrent neural networks or transformers, and advanced training techniques, such as teacher forcing or scheduled sampling.)
[ ] head-to-head (Def: In a head-to-head competition, the entities are compared against each other to determine which one performs better or is more effective.)
[ ] off-the-shelf LLM (Def: "Off-the-shelf" refers to a product that is ready to use without any customization or modification. In the context of Large Language Models (LLMs), an "off-the-shelf" LLM refers to a pre-trained language model that can be directly used for various natural language processing tasks without any further training or fine-tuning. These models have been trained on large datasets and have learned to generate coherent and contextually relevant text based on the input they receive. Examples of off-the-shelf large language models include GPT-3 by OpenAI, BERT by Google, and T5 by Google. These models can be used for a variety of tasks such as text completion, sentiment analysis, text summarization, question-answering, and more. It is worth noting that while off-the-shelf LLMs can be used for a wide range of tasks, they might not always perform as well as a model that has been specifically fine-tuned for a particular task. Fine-tuning a pre-trained model on a task-specific dataset can help the model to better understand the nuances of the task and improve its performance.)
[ ] statistically significantly different (Def: When we say that two groups are "statistically significantly different," we mean that the observed difference between the groups is unlikely to have occurred by random chance, and that there is likely a true difference between the populations from which the groups were drawn. Statistical significance is often tested using hypothesis testing. In a hypothesis test, we start with a null hypothesis that there is no difference between the groups, and an alternative hypothesis that there is a difference. We then calculate a p-value, which is the probability of observing the data (or something more extreme) if the null hypothesis is true. If the p-value is less than a predetermined significance level (commonly 0.05), we reject the null hypothesis and conclude that the difference between the groups is statistically significant. It's important to note that "statistically significant" does not necessarily mean that the difference is large or practically important, only that it is unlikely to have occurred by chance. It's also possible for a difference to be statistically significant but not clinically or practically significant.)
[ ] prompt engineering / prompting (Def: Prompt engineering, or prompting, refers to the process of designing input prompts to guide a machine learning model, particularly large language models like GPT-3, to generate desired outputs. By carefully crafting the prompt, users can influence the behavior of the model and improve its performance on specific tasks. Prompt engineering can be seen as a form of "programming" for machine learning models, where instead of writing code to specify the desired behavior, users provide examples or instructions in the form of natural language prompts. This approach is particularly useful for few-shot or zero-shot learning, where the model is able to generalize from a small number of examples or even a single example to perform a specific task. Examples of prompt engineering include: Task-Specific Prompts: Providing a specific instruction or question that clearly defines the task, e.g., "Translate the following English text to French:" Example-Based Prompts: Providing examples of the desired behavior, e.g., "Here are some examples of English text and their French translations:" Contextual Prompts: Providing additional context or information that helps the model understand the task, e.g., "You are a helpful assistant that translates English text to French." Prompt engineering can be a powerful tool for improving the performance of machine learning models, but it also requires careful consideration and experimentation to find the most effective prompts for a given task.)
[ ] prompt template (Def: A prompt template is a structured format for creating prompts used in machine learning, particularly with large language models like GPT-3. Prompt templates help guide the model to generate the desired output by providing clear and consistent instructions, examples, or context. They are especially useful for few-shot learning, where the model needs to understand the task based on a small number of examples. Here's an example of a prompt template for a text summarization task:
In this template, "Task" provides clear instruction on what needs to be done, "Text" is a placeholder for the input text to be summarized, and "Summary" is where the model's output will be generated. The user can then fill in the placeholder with the specific text they want to summarize, and the model will generate a summary in response to the prompt. By using a prompt template, users can ensure that their instructions are clear and consistent, making it easier for the model to understand the task and generate the desired output.)
[ ] chain-of-thought rationales (Def: Chain-of-thought rationales are a way of enhancing language models by providing step-by-step explanations or reasonings for the answers they generate. These rationales break down complex tasks into intermediate steps, making it easier for the model to arrive at the correct answer and for users to understand the model's thought process. When using chain-of-thought rationales, users can prompt the model to provide a detailed explanation of how it arrived at its answer. For example:
In this example, the model provides a step-by-step explanation of how it found the prime factors of 60, making it easier for the user to understand the answer and the process used to arrive at it. Chain-of-thought rationales can be particularly useful for complex tasks that require multiple steps or considerations to solve, such as math problems, logic puzzles, or reasoning tasks. They can also enhance the transparency and interpretability of language models, making it easier for users to understand and trust their outputs.)
[x] 0-shot (Def: "Zero-shot" or "0-shot" learning refers to a machine learning scenario where a model is able to perform a task without having seen any examples of that specific task during training. In other words, the model can generalize to new tasks that were not part of its training data. Zero-shot learning is commonly discussed in the context of natural language processing (NLP) and computer vision. For example, in NLP, a zero-shot learning model might be able to generate text in response to a prompt that it has never seen before, based on its understanding of language acquired during training. In computer vision, a zero-shot learning model might be able to recognize objects or scenes that it has never seen before, based on its understanding of visual features. Zero-shot learning is often enabled by pre-training models on large, diverse datasets that capture a broad range of knowledge. The idea is that by learning from a wide variety of examples, the model can develop a rich understanding of the domain and be able to generalize to new tasks that are related to the training data, even if they are not exactly the same.)
[x] few-shot example (Def: Few-shot learning refers to training a machine learning model on a very small dataset, which typically consists of only a few examples per class. The challenge is to develop a model that can generalize well to new, unseen data despite the limited amount of training data. Here's an example of few-shot learning in image classification: Imagine you have a dataset with images of different types of animals, and you want to train a model to classify these images into their respective animal categories. However, you only have a few images for each animal type. This is where few-shot learning comes into play. In a few-shot learning setting, you might have the following: 5 examples of cats 5 examples of dogs 5 examples of birds This would be referred to as a "5-shot" learning problem because you have 5 examples for each class. To solve this problem, you can use a few-shot learning approach, such as: Transfer Learning: Pre-train a model on a large dataset with many examples of different animal types, and then fine-tune the model on your small dataset with only a few examples per class. Meta-Learning: Train a model in a way that it can quickly adapt to new tasks with only a few examples. This involves training the model on a variety of tasks with limited data, and then fine-tuning it on the specific task of interest. Data Augmentation: Increase the size of your small dataset by applying various transformations (e.g., rotation, scaling, cropping) to the existing images to create new, augmented images. By using these and other few-shot learning techniques, you can train a model that can generalize well to new, unseen examples of animal types, despite the limited amount of training data.)
[ ] Kullback-Leibler (KL) divergence loss (Def: Kullback-Leibler (KL) divergence is a measure of how one probability distribution diverges from a second, expected probability distribution. It is used in various fields, including machine learning, statistics, and information theory. In the context of machine learning, the KL divergence can be used as a loss function to measure the difference between the predicted probability distribution and the true distribution. This is often referred to as the KL divergence loss or KL loss. The goal of training a model with KL loss is to minimize the divergence between the predicted and true distributions, thus making the model's predictions as close as possible to the true distribution. The KL divergence loss is commonly used in various machine learning tasks, including classification, regression, and generative models, where it can be used to measure the difference between the model's output and the ground truth data. It is also used in variational autoencoders (VAEs) as part of the loss function to regularize the learned latent space.)
[x] self-consistency (Def: In the context of machine learning, self-consistency refers to the property that a model's predictions or outputs are consistent with each other. This can mean that the model's predictions are consistent across different inputs that should lead to the same output, or that the model's predictions are consistent with its internal representations and learned knowledge. Self-consistency is an important property for many types of models, including language models, where it is desirable that the model produces consistent and coherent text output. For example, if a language model is asked to generate a paragraph of text, the sentences within the paragraph should be consistent with each other in terms of grammar, style, and content. One way to achieve self-consistency in a model is to use techniques like regularization, which can help to enforce consistency between the model's predictions and its internal representations. Another approach is to use ensemble methods, where multiple models are combined to produce a single output, and any inconsistencies between the models can be resolved through voting or averaging. Self-consistency is also related to the concept of robustness, which refers to a model's ability to produce consistent and accurate predictions in the presence of noise, perturbations, or adversarial examples. Models that are self-consistent and robust are generally more reliable and trustworthy.)
[ ] model distillation (Def: Model distillation, also known as knowledge distillation, is a technique used in machine learning to improve the efficiency of a model by transferring knowledge from a larger, more complex model (the "teacher" model) to a smaller, more efficient model (the "student" model). The goal of model distillation is to create a student model that performs similarly to the teacher model but with less computational overhead. The process of model distillation involves the following steps: Train a large and complex teacher model on a specific task, such as image classification, natural language processing, or speech recognition. Use the teacher model to generate predictions or embeddings for a dataset. Train a smaller student model to mimic the predictions or embeddings generated by the teacher model. Fine-tune the student model on the original task. The student model learns to replicate the behavior of the teacher model by training on the teacher's outputs rather than the ground-truth labels. This allows the student model to capture the knowledge and patterns learned by the teacher model, while being more efficient in terms of computation and memory requirements. Model distillation can be particularly useful in scenarios where computational resources are limited, such as on mobile devices or embedded systems, where deploying a large and complex model may not be feasible. It can also be used to improve the performance of smaller models by leveraging the knowledge learned by larger, more accurate models.)
[ ] Advantage Actor Critic (A2C) (Def: Advantage Actor-Critic (A2C) is a reinforcement learning algorithm that combines the strengths of actor-critic methods and advantage functions. Here's how the A2C algorithm works: Actor: The actor takes the current state of the environment as input and outputs an action to be taken. The actor is parameterized by a policy PI(a|s; θ) , where a is the action, s is the state, and θ are the parameters of the policy. The policy defines the probability of taking action a in state s. Critic: The critic takes the current state of the environment as input and outputs an estimate of the value of that state, i.e., the expected cumulative future rewards. The critic is parameterized by a value function V(s;ω), where s is the state and ω are the parameters of the value function. Advantage Function: The advantage function A(s,a)=Q(s,a)−V(s) is used to calculate the advantage of taking action a in state s, where Q(s,a) is the action-value function. The advantage function measures how much better (or worse) an action is compared to the average action in a given state. The advantage function is used to update the policy in the direction that increases the probability of taking actions with higher advantages. Training: The parameters of the actor and critic are updated using gradient ascent and gradient descent, respectively, to maximize the expected cumulative future rewards. A2C is an on-policy algorithm, meaning it uses the current policy to generate samples for training. It is also a synchronous version of the Asynchronous Advantage Actor-Critic (A3C) algorithm, where multiple agents are used to explore the environment in parallel, and the gradients are accumulated before updating the parameters. A2C has been used successfully in various applications, including robotics, video games, and autonomous control.)
[ ] Proximal Policy Optimization (PPO) (Def: Proximal Policy Optimization (PPO) is a popular reinforcement learning algorithm introduced by OpenAI in 2017. It is designed to improve the stability and sample efficiency of training compared to other algorithms such as Trust Region Policy Optimization (TRPO). PPO aims to optimize the policy while ensuring that the new policy is not too far from the old policy. This is important to prevent the policy from changing too drastically, which can lead to instability in training. The key idea behind PPO is to use a surrogate objective function that penalizes changes to the policy that move it too far from the old policy. PPO aims to maximize the surrogate objective function, which encourages the new policy to be close to the old policy while still improving the policy to increase cumulative rewards. The clipping function ensures that the probability ratio does not deviate too far from 1, which prevents large policy updates. PPO has been successful in a wide range of reinforcement learning tasks, including robotics, video games, and control problems. It is known for its simplicity, ease of implementation, and good performance across different tasks.)
[x] AI Labeler Alignment (Def: It refers to the process of ensuring that the labels generated by an AI system are aligned with the intended task or objective. Labeling is a crucial step in supervised learning, where a model is trained on a dataset consisting of input-output pairs. The "labels" are the output or ground-truth values that the model aims to predict. For example, in a classification task, the labels would be the classes to which the input data points belong. Alignment in this context would mean ensuring that the labels generated by the AI system are accurate and consistent with the task at hand. This can involve several steps, such as: Data Cleaning: Ensuring that the dataset is free of errors, inconsistencies, and missing values that could affect the quality of the labels. Data Preprocessing: Transforming the data into a format that is suitable for the model, such as normalizing numerical values or encoding categorical variables. Label Verification: Checking that the labels are correct and consistent with the task. This can involve manual verification by human annotators or automated checks using other data sources. Model Evaluation: Assessing the performance of the model using metrics that are relevant to the task, such as accuracy, precision, recall, or F1 score. Ensuring labeler alignment is important for the success of an AI system, as accurate and consistent labels are essential for training a model that generalizes well to new, unseen data.)
[ ] Pairwise Accuracy (Def: Pairwise accuracy is a performance metric used in ranking or ordinal regression tasks, where the goal is to predict the correct order of items rather than their absolute values. It is often used in information retrieval, recommendation systems, and other applications where the relative ordering of items is more important than their precise scores. Pairwise accuracy is calculated as the proportion of pairs of items that are correctly ordered by the model. For example, consider a ranking task where the goal is to order a list of items from best to worst. If the model correctly orders 8 out of 10 possible pairs of items, then the pairwise accuracy would be 0.8 or 80%. The formula for pairwise accuracy is: Pairwise Accuracy = (Number of Correctly Ordered Pairs)/ (Total Number of Pairs) Pairwise accuracy is a useful metric for ranking tasks because it directly measures the model's ability to order items correctly. It is less sensitive to the absolute values predicted by the model and focuses more on the relative ordering, which is often the most important aspect in ranking tasks.)
[x] Win Rate (Def: Win rate is a performance metric commonly used in competitive games and other situations where two or more entities compete against each other, and the outcome is a win, loss, or draw. Win rate is calculated as the ratio of the number of wins to the total number of games or matches played. The formula for win rate is: Win Rate = (Number of Wins)/(Total Number of Games) Win rate is often expressed as a percentage, where a win rate of 0.5 or 50% indicates that the entity has won half of the games played. In the context of machine learning models, win rate can be used as a performance metric in scenarios where the model is competing against other models or baselines. For example, in a game-playing AI, the win rate would be the proportion of games that the AI wins against its opponents. In this case, a higher win rate would indicate that the AI is more successful at the game. Win rate can also be used in A/B testing and other experimental settings, where different versions of a product or feature are compared against each other to determine which one performs better. In this case, the win rate would be the proportion of times that the new version outperforms the old version.)
[ ] MDP (Def: A Markov Decision Process (MDP) is a mathematical model used to describe a decision-making problem in a fully observable environment. MDPs are widely used in reinforcement learning and decision theory to represent problems where an agent takes actions in an environment to maximize some notion of cumulative reward. An MDP is defined by the following components: States (S): A set of possible states that the environment can be in. Actions (A): A set of possible actions that the agent can take. Transition Probabilities (P): A function that specifies the probability of reaching a particular state after taking a particular action in a particular state, i.e., P(s'∣s,a) where s and s' are the current and next states, respectively, and a is the action taken. Rewards (R): A function that specifies the reward received by the agent after taking a particular action in a particular state, i.e., R(s,a). Discount Factor (γ): A value between 0 and 1 that represents the agent's preference for immediate rewards over future rewards. A discount factor of 1 means the agent values future rewards as much as immediate rewards, while a discount factor of 0 means the agent only cares about immediate rewards. The objective of the agent in an MDP is to find a policy π that specifies the action to take in each state to maximize the expected cumulative reward. The policy can be deterministic, where it specifies a single action to take in each state, or stochastic, where it specifies a probability distribution over actions in each state. The optimal policy is the one that maximizes the expected cumulative reward over time. Solving an MDP involves finding the optimal policy that maximizes the expected cumulative reward. There are several algorithms for solving MDPs, including value iteration, policy iteration, and Q-learning.)