First, Alpaca is based on LLaMA, which has a non-commercial license
Second, the instruction data is based on OpenAI’s text-davinci-003, whose terms of use prohibit developing models that compete with OpenAI.
OpenAI terms of use
closedAI
Training recipe
Data Generation Process (self-Instruct를 단순화 해서 사용)
We built on the data generation pipeline from self-instruct and made the following modifications:
We used text-davinci-003 to generate the instruction data instead of davinci.
We wrote a new prompt (prompt.txt) that explicitly gave the requirement of instruction generation to text-davinci-003. Note: there is a slight error in the prompt we used, and future users should incorporate the edit in 24 issue
We adopted much more aggressive batch decoding, i.e., generating 20 instructions at once, which significantly reduced the cost of data generation.
We simplified the data generation pipeline by discarding the difference between classification and non-classification instructions.
분류냐 아니냐 구분하는거 삭제함
We only generated a single instance for each instruction, instead of 2 to 3 instances as in LLaMA.
52K unique instruction 생성에 $500정도 사용됨
we then fine-tuned the LLaMA models using Hugging Face’s training framework, taking advantage of techniques like Fully Sharded Data Parallel(FSDP) and mixed precision training.
For our initial run, fine-tuning a 7B LLaMA model took 3 hours on 8 80GB A100s, which costs less than $100 on most cloud compute providers
# Preliminary evaluation
- conduct human evaluation (by the 5 student authors) on the inputs from the [self-instruct evaluation set](https://github.com/yizhongw/self-instruct/blob/main/human_eval/user_oriented_instructions.jsonl).
- self-instruct에서 human 평가를 위한 251개의 데이터셋이 있었음 (diverse list of user-oriented instructions including email writing, social media, and productivity tools.)
- 블라인드 테스트로 평가함, 결과는 비슷함 (90:89 = alpaca:davinci-003)
- blind pairwise comparison between text-davinci-003 and Alpaca 7B
- Alpaca wins 90 versus 89 comparisons against text-davinci-003.
- 생성된 길이가 ChatGPT보다 짧은건, text-davinci-003이 짧게 생성해주기 때문이다로 주장
![image](https://user-images.githubusercontent.com/7252598/225484685-43db7d87-2c39-4e3c-b978-f7bdc7d75782.png)
# Known limitations
- hallucination, toxicity, and stereotypes. Hallucination
# Assets released
- Demo: An [interactive demo](https://crfm.stanford.edu/alpaca/) for everyone to try out Alpaca.
- Data: [52K demonstrations](https://github.com/tatsu-lab/stanford_alpaca#data-release) used to fine-tune Alpaca.
- Data generation process: the code for [generating the data](https://github.com/tatsu-lab/stanford_alpaca#data-generation-process).
- Hyperparameters: for [fine-tuning](https://github.com/tatsu-lab/stanford_alpaca#fine-tuning) the model using the Hugging Face API.
## 공개된 Data 예시
![image](https://user-images.githubusercontent.com/7252598/225485522-f6c9a03e-2eac-4473-ba66-9d61b22217ff.png)
- 곧 공개 예정
- Model weights: We have reached out to Meta to obtain guidance on releasing the Alpaca model weights, both for the 7B Alpaca and for fine-tuned versions of the larger LLaMA models.
- Training code: our code uses the [Hugging Face interface to LLaMA](https://github.com/huggingface/transformers/pull/21955). As of now, the effort to support LLaMA is still ongoing and not stable. We will give the exact training commands once Hugging Face supports LLaMA officially.
- [아래에서 IGNORE_INDEX 부분이 중요!](https://github.com/tatsu-lab/stanford_alpaca/blob/61a3b4324505d284200a35dcbf1cc5e438ff2b46/train.py#L133)
```python
def preprocess(
sources: Sequence[str],
targets: Sequence[str],
tokenizer: transformers.PreTrainedTokenizer,
) -> Dict:
"""Preprocess the data by tokenizing."""
examples = [s + t for s, t in zip(sources, targets)]
examples_tokenized, sources_tokenized = [_tokenize_fn(strings, tokenizer) for strings in (examples, sources)]
input_ids = examples_tokenized["input_ids"]
labels = copy.deepcopy(input_ids)
for label, source_len in zip(labels, sources_tokenized["input_ids_lens"]):
label[:source_len] = IGNORE_INDEX
return dict(input_ids=input_ids, labels=labels)
Future directions
Evaluation: We need to evaluate Alpaca more rigorously. We will start with HELM (Holistic Evaluation of Language Models)
Safety: We would like to further study the risks of Alpaca and improve its safety using methods such as automatic red teaming, auditing, and adaptive testing.
Understanding: We hope to better understand how capabilities arise from the training recipe. What properties of a base model do you need? What happens when you scale up? What properties of instruction data is needed? What are alternatives to using self-instruct on text-davinci-003?
Acknowledgements
We would also like to highlight that there are many other open efforts for instruction-following LLMs and chat models, including OpenChatKit, Open Assistant, and Carper AI.
def generate_instruction_following_data(
output_dir="./",
seed_tasks_path="./seed_tasks.jsonl",
num_instructions_to_generate=100,
model_name="text-davinci-003",
num_prompt_instructions=3,
request_batch_size=5,
temperature=1.0,
top_p=1.0,
num_cpus=16,
):
seed_tasks = [json.loads(l) for l in open(seed_tasks_path, "r")]
seed_instruction_data = [
{"instruction": t["instruction"], "input": t["instances"][0]["input"], "output": t["instances"][0]["output"]}
for t in seed_tasks
]
print(f"Loaded {len(seed_instruction_data)} human-written seed instructions")
os.makedirs(output_dir, exist_ok=True)
request_idx = 0
# load the LM-generated instructions
machine_instruction_data = []
if os.path.exists(os.path.join(output_dir, "regen.json")):
machine_instruction_data = utils.jload(os.path.join(output_dir, "regen.json"))
print(f"Loaded {len(machine_instruction_data)} machine-generated instructions")
# similarities = {}
scorer = rouge_scorer.RougeScorer(["rougeL"], use_stemmer=False)
# now let's generate new instructions!
progress_bar = tqdm.tqdm(total=num_instructions_to_generate)
if machine_instruction_data:
progress_bar.update(len(machine_instruction_data))
# first we tokenize all the seed instructions and generated machine instructions
all_instructions = [d["instruction"] for d in seed_instruction_data] + [
d["instruction"] for d in machine_instruction_data
]
all_instruction_tokens = [scorer._tokenizer.tokenize(inst) for inst in all_instructions]
while len(machine_instruction_data) < num_instructions_to_generate:
request_idx += 1
batch_inputs = []
for _ in range(request_batch_size):
# only sampling from the seed tasks
prompt_instructions = random.sample(seed_instruction_data, num_prompt_instructions)
prompt = encode_prompt(prompt_instructions)
batch_inputs.append(prompt)
decoding_args = utils.OpenAIDecodingArguments(
temperature=temperature,
n=1,
max_tokens=3072, # hard-code to maximize the length. the requests will be automatically adjusted
top_p=top_p,
stop=["\n20", "20.", "20."],
)
20개의 다양한 task instruction 세트를 작성하라는 요청을 받습니다. 이러한 task instruction은 GPT 모델에 제공되며 instruction을 완료하기 위해 GPT 모델을 평가합니다.
요구 사항은 다음과 같습니다.
다양성을 극대화하기 위해 각 instruction에 대해 동사를 반복하지 마십시오.
instruction에 사용되는 언어도 다양해야 합니다. 예를 들어, 명령형 instruction과 질문을 결합해야 합니다.
instruction의 종류가 다양해야 한다. 목록에는 open-ended 생성, 분류, 편집 등과 같은 다양한 유형의 작업이 포함되어야 합니다.
GPT 언어 모델은 instruction을 완료할 수 있어야 합니다. 예를 들어 어시스턴트에게 시각적, 사진, 이미지 또는 오디오 출력과 관련된 instruction을 생성하지 마십시요. 또 다른 예를 들면 어시스턴트에게 오후 5시에 깨우라고 요청하거나 어떤 작업도 수행할 수 없기 때문에 미리 알림을 설정하지 마십시오.
instruction는 한국어로 작성해야 합니다.
instruction은 1~2문장이어야 합니다. 명령형 문장이나 질문이 허용됩니다.
instruction에 대한 적절한 입력을 생성해야 합니다. 입력 필드에는 instruction에 대해 제공된 특정 예가 포함되어야 합니다. 현실적인 데이터를 포함해야 하며 단순한 자리 표시자를 포함해서는 안 됩니다. 입력 내용은 instruction을 어렵게 만들 수 있는 실질적인 내용을 제공해야 하지만 이상적으로는 100단어를 초과하지 않아야 합니다.
모든 instruction에 입력이 필요한 것은 아닙니다. 예를 들어, instruction이 "세계에서 가장 높은 봉우리는 무엇입니까?"와 같은 일반적인 정보에 대해 묻는 경우 특정 컨텍스트를 제공할 필요가 없습니다. 이 경우 입력 필드에 ""을 넣기만 하면 됩니다.
출력은 instruction과 입력에 대한 적절한 응답이어야 합니다. 출력이 100단어 미만인지 확인하십시오.
20개 Task instruction 목록:
- 한국어 결과
![alpaca-self-gen-korean](https://user-images.githubusercontent.com/7252598/225496927-2df4614a-8e35-4032-b65a-54e898ca61e6.gif)
- 영어 결과
![alpaca-gen-self-instruct-en](https://user-images.githubusercontent.com/7252598/225496953-69f90efb-a20e-4147-bf4a-df623fc33e83.gif)
Note
Author
Summary
Abstract
Overview
closed-source models such as OpenAI’s text-davinci-003.
Training recipe
Data Generation Process (self-Instruct를 단순화 해서 사용)
wrote a new prompt (prompt.txt)
that explicitly gave the requirement of instruction generation to text-davinci-003. Note: there is a slight error in the prompt we used, and future users should incorporate the edit in 24 issueFully Sharded Data Parallel(FSDP)
andmixed precision training
.3 hours on 8 80GB A100s
, which costsless than $100
on most cloud compute providersFuture directions
Acknowledgements
prompt
requirement 템플릿에다가 seed_task 입력해놓음 default는 3개!
def generate_instruction_following_data( output_dir="./", seed_tasks_path="./seed_tasks.jsonl", num_instructions_to_generate=100, model_name="text-davinci-003", num_prompt_instructions=3, request_batch_size=5, temperature=1.0, top_p=1.0, num_cpus=16, ): seed_tasks = [json.loads(l) for l in open(seed_tasks_path, "r")] seed_instruction_data = [ {"instruction": t["instruction"], "input": t["instances"][0]["input"], "output": t["instances"][0]["output"]} for t in seed_tasks ] print(f"Loaded {len(seed_instruction_data)} human-written seed instructions")
20개의 다양한 task instruction 세트를 작성하라는 요청을 받습니다. 이러한 task instruction은 GPT 모델에 제공되며 instruction을 완료하기 위해 GPT 모델을 평가합니다.
요구 사항은 다음과 같습니다.
20개 Task instruction 목록: