Note

발표자료:
- Alpaca_js.pdf

Author

Rohan Taori* and Ishaan Gulrajani* and Tianyi Zhang* and Yann Dubois* and Xuechen Li* and Carlos Guestrin and Percy Liang and Tatsunori B. Hashimoto
- Stanford

Summary

Meta에서 공개한 LLaMA와 Self-Instruct 조합으로 꽤 괜찮은 instruction tuning model Alpaca 개발
600$ 이하로 만듬
- 52K Instruction -> $500
- 3 hours 8 80GB A100s -> $100
HF로 FSDP 사용
safety나 기타 이슈는 존재하지만 공개

Abstract

We introduce Alpaca 7B, a model fine-tuned from the LLaMA 7B model on 52K instruction-following demonstrations.
preliminary evaluation of single-turn instruction following, Alpaca behaves qualitatively similarly to OpenAI’s text-davinci-003,
- surprisingly small and easy/cheap to reproduce (<600$).

Overview

GPT-3.5, ChatGPT, Claude, Bing Chat등 다양한 Instruction-following 모델들이 나옴
해결해야될 문제들 많지만 academia는 연구가 쉽지 않다! ex) closed-source models such as OpenAI’s text-davinci-003.
LLaMA 7B짜리로 잘 튜닝해서 Alpaca라는거 만들었음
- 52K instruction-following demonstrations generated in the style of self-instruct using text-davinci-003.
- text-davinci-003이랑 비슷
- Interaction demo도 공개
Alpaca는 academic research에 한정해서 사용가능, commercial use는 금지됨
- First, Alpaca is based on LLaMA, which has a non-commercial license
- Second, the instruction data is based on OpenAI’s text-davinci-003, whose terms of use prohibit developing models that compete with OpenAI.

OpenAI terms of use	closedAI

Training recipe

Data Generation Process (self-Instruct를 단순화 해서 사용)

We built on the data generation pipeline from self-instruct and made the following modifications:
- We used text-davinci-003 to generate the instruction data instead of davinci.
- We wrote a new prompt (prompt.txt) that explicitly gave the requirement of instruction generation to text-davinci-003. Note: there is a slight error in the prompt we used, and future users should incorporate the edit in 24 issue
- 프롬프트를 생성할때 requirements를 줌
- We adopted much more aggressive batch decoding, i.e., generating 20 instructions at once, which significantly reduced the cost of data generation.
- We simplified the data generation pipeline by discarding the difference between classification and non-classification instructions.
- 분류냐 아니냐 구분하는거 삭제함
- We only generated a single instance for each instruction, instead of 2 to 3 instances as in LLaMA.
52K unique instruction 생성에 $500정도 사용됨
we then fine-tuned the LLaMA models using Hugging Face’s training framework, taking advantage of techniques like Fully Sharded Data Parallel(FSDP) and mixed precision training.
For our initial run, fine-tuning a 7B LLaMA model took 3 hours on 8 80GB A100s, which costs less than $100 on most cloud compute providers

python 3.10 사용


torchrun --nproc_per_node=4 --master_port=<your_random_port> train.py \
--model_name_or_path <your_path_to_hf_converted_llama_ckpt_and_tokenizer> \
--data_path ./alpaca_data.json \
--bf16 True \
--output_dir <your_output_dir> \
--num_train_epochs 3 \
--per_device_train_batch_size 4 \
--per_device_eval_batch_size 4 \
--gradient_accumulation_steps 8 \
--evaluation_strategy "no" \
--save_strategy "steps" \
--save_steps 2000 \
--save_total_limit 1 \
--learning_rate 2e-5 \
--weight_decay 0. \
--warmup_ratio 0.03 \
--lr_scheduler_type "cosine" \
--logging_steps 1 \
--fsdp "full_shard auto_wrap" \
--fsdp_transformer_layer_cls_to_wrap 'LLaMADecoderLayer' \
--tf32 True


# Preliminary evaluation
- conduct human evaluation (by the 5 student authors) on the inputs from the [self-instruct evaluation set](https://github.com/yizhongw/self-instruct/blob/main/human_eval/user_oriented_instructions.jsonl).
  - self-instruct에서 human 평가를 위한 251개의 데이터셋이 있었음 (diverse list of user-oriented instructions including email writing, social media, and productivity tools.)
- 블라인드 테스트로 평가함, 결과는 비슷함 (90:89 = alpaca:davinci-003)
  - blind pairwise comparison between text-davinci-003 and Alpaca 7B
  - Alpaca wins 90 versus 89 comparisons against text-davinci-003.
- 생성된 길이가 ChatGPT보다 짧은건, text-davinci-003이 짧게 생성해주기 때문이다로 주장
![image](https://user-images.githubusercontent.com/7252598/225484685-43db7d87-2c39-4e3c-b978-f7bdc7d75782.png)

# Known limitations
- hallucination, toxicity, and stereotypes. Hallucination

# Assets released
- Demo: An [interactive demo](https://crfm.stanford.edu/alpaca/) for everyone to try out Alpaca.
- Data: [52K demonstrations](https://github.com/tatsu-lab/stanford_alpaca#data-release) used to fine-tune Alpaca.
- Data generation process: the code for [generating the data](https://github.com/tatsu-lab/stanford_alpaca#data-generation-process).
- Hyperparameters: for [fine-tuning](https://github.com/tatsu-lab/stanford_alpaca#fine-tuning) the model using the Hugging Face API.

## 공개된 Data 예시
![image](https://user-images.githubusercontent.com/7252598/225485522-f6c9a03e-2eac-4473-ba66-9d61b22217ff.png)

- 곧 공개 예정
  - Model weights: We have reached out to Meta to obtain guidance on releasing the Alpaca model weights, both for the 7B Alpaca and for fine-tuned versions of the larger LLaMA models.
  - Training code: our code uses the [Hugging Face interface to LLaMA](https://github.com/huggingface/transformers/pull/21955). As of now, the effort to support LLaMA is still ongoing and not stable. We will give the exact training commands once Hugging Face supports LLaMA officially.
  - [아래에서 IGNORE_INDEX 부분이 중요!](https://github.com/tatsu-lab/stanford_alpaca/blob/61a3b4324505d284200a35dcbf1cc5e438ff2b46/train.py#L133)
```python
def preprocess(
    sources: Sequence[str],
    targets: Sequence[str],
    tokenizer: transformers.PreTrainedTokenizer,
) -> Dict:
    """Preprocess the data by tokenizing."""
    examples = [s + t for s, t in zip(sources, targets)]
    examples_tokenized, sources_tokenized = [_tokenize_fn(strings, tokenizer) for strings in (examples, sources)]
    input_ids = examples_tokenized["input_ids"]
    labels = copy.deepcopy(input_ids)
    for label, source_len in zip(labels, sources_tokenized["input_ids_lens"]):
        label[:source_len] = IGNORE_INDEX
    return dict(input_ids=input_ids, labels=labels)

Future directions

Evaluation: We need to evaluate Alpaca more rigorously. We will start with HELM (Holistic Evaluation of Language Models)
Safety: We would like to further study the risks of Alpaca and improve its safety using methods such as automatic red teaming, auditing, and adaptive testing.
Understanding: We hope to better understand how capabilities arise from the training recipe. What properties of a base model do you need? What happens when you scale up? What properties of instruction data is needed? What are alternatives to using self-instruct on text-davinci-003?

Acknowledgements

We would also like to highlight that there are many other open efforts for instruction-following LLMs and chat models, including OpenChatKit, Open Assistant, and Carper AI.

prompt

requirement 템플릿에다가 seed_task 입력해놓음 default는 3개!


def encode_prompt(prompt_instructions):
"""Encode multiple prompt instructions into a single string."""
prompt = open("./prompt.txt").read() + "\n"

for idx, task_dict in enumerate(prompt_instructions):
    (instruction, input, output) = task_dict["instruction"], task_dict["input"], task_dict["output"]
    instruction = re.sub(r"\s+", " ", instruction).strip().rstrip(":")
    input = "<noinput>" if input.lower() == "" else input
    prompt += f"###\n"
    prompt += f"{idx + 1}. Instruction: {instruction}\n"
    prompt += f"{idx + 1}. Input:\n{input}\n"
    prompt += f"{idx + 1}. Output:\n{output}\n"
prompt += f"###\n"
prompt += f"{idx + 2}. Instruction:"
return prompt

def generate_instruction_following_data( output_dir="./", seed_tasks_path="./seed_tasks.jsonl", num_instructions_to_generate=100, model_name="text-davinci-003", num_prompt_instructions=3, request_batch_size=5, temperature=1.0, top_p=1.0, num_cpus=16, ): seed_tasks = [json.loads(l) for l in open(seed_tasks_path, "r")] seed_instruction_data = [ {"instruction": t["instruction"], "input": t["instances"][0]["input"], "output": t["instances"][0]["output"]} for t in seed_tasks ] print(f"Loaded {len(seed_instruction_data)} human-written seed instructions")

os.makedirs(output_dir, exist_ok=True)
request_idx = 0
# load the LM-generated instructions
machine_instruction_data = []
if os.path.exists(os.path.join(output_dir, "regen.json")):
    machine_instruction_data = utils.jload(os.path.join(output_dir, "regen.json"))
    print(f"Loaded {len(machine_instruction_data)} machine-generated instructions")

# similarities = {}
scorer = rouge_scorer.RougeScorer(["rougeL"], use_stemmer=False)

# now let's generate new instructions!
progress_bar = tqdm.tqdm(total=num_instructions_to_generate)
if machine_instruction_data:
    progress_bar.update(len(machine_instruction_data))

# first we tokenize all the seed instructions and generated machine instructions
all_instructions = [d["instruction"] for d in seed_instruction_data] + [
    d["instruction"] for d in machine_instruction_data
]
all_instruction_tokens = [scorer._tokenizer.tokenize(inst) for inst in all_instructions]

while len(machine_instruction_data) < num_instructions_to_generate:
    request_idx += 1

    batch_inputs = []
    for _ in range(request_batch_size):
        # only sampling from the seed tasks
        prompt_instructions = random.sample(seed_instruction_data, num_prompt_instructions)
        prompt = encode_prompt(prompt_instructions)
        batch_inputs.append(prompt)
    decoding_args = utils.OpenAIDecodingArguments(
        temperature=temperature,
        n=1,
        max_tokens=3072,  # hard-code to maximize the length. the requests will be automatically adjusted
        top_p=top_p,
        stop=["\n20", "20.", "20."],
    )

20개의 다양한 task instruction 세트를 작성하라는 요청을 받습니다. 이러한 task instruction은 GPT 모델에 제공되며 instruction을 완료하기 위해 GPT 모델을 평가합니다.

요구 사항은 다음과 같습니다.

다양성을 극대화하기 위해 각 instruction에 대해 동사를 반복하지 마십시오.
instruction에 사용되는 언어도 다양해야 합니다. 예를 들어, 명령형 instruction과 질문을 결합해야 합니다.
instruction의 종류가 다양해야 한다. 목록에는 open-ended 생성, 분류, 편집 등과 같은 다양한 유형의 작업이 포함되어야 합니다.
GPT 언어 모델은 instruction을 완료할 수 있어야 합니다. 예를 들어 어시스턴트에게 시각적, 사진, 이미지 또는 오디오 출력과 관련된 instruction을 생성하지 마십시요. 또 다른 예를 들면 어시스턴트에게 오후 5시에 깨우라고 요청하거나 어떤 작업도 수행할 수 없기 때문에 미리 알림을 설정하지 마십시오.
instruction는 한국어로 작성해야 합니다.
instruction은 1~2문장이어야 합니다. 명령형 문장이나 질문이 허용됩니다.
instruction에 대한 적절한 입력을 생성해야 합니다. 입력 필드에는 instruction에 대해 제공된 특정 예가 포함되어야 합니다. 현실적인 데이터를 포함해야 하며 단순한 자리 표시자를 포함해서는 안 됩니다. 입력 내용은 instruction을 어렵게 만들 수 있는 실질적인 내용을 제공해야 하지만 이상적으로는 100단어를 초과하지 않아야 합니다.
모든 instruction에 입력이 필요한 것은 아닙니다. 예를 들어, instruction이 "세계에서 가장 높은 봉우리는 무엇입니까?"와 같은 일반적인 정보에 대해 묻는 경우 특정 컨텍스트를 제공할 필요가 없습니다. 이 경우 입력 필드에 ""을 넣기만 하면 됩니다.
출력은 instruction과 입력에 대한 적절한 응답이어야 합니다. 출력이 100단어 미만인지 확인하십시오.

20개 Task instruction 목록:



- 한국어 결과
![alpaca-self-gen-korean](https://user-images.githubusercontent.com/7252598/225496927-2df4614a-8e35-4032-b65a-54e898ca61e6.gif)

- 영어 결과
![alpaca-gen-self-instruct-en](https://user-images.githubusercontent.com/7252598/225496953-69f90efb-a20e-4147-bf4a-df623fc33e83.gif)

eagle705 / presentation

Alpaca: A Strong Instruction-Following Model #25