YunchaoYang / Blogs

blogs and notes, https://yunchaoyang.github.io/blogs/
0 stars 0 forks source link

Run Llama 2 onHiPerGator, a step-by-step guide #49

Open YunchaoYang opened 7 months ago

YunchaoYang commented 7 months ago

Installation

  1. Use this link to request a access email with downloadable link.

In order to download the model weights and tokenizer, please visit the Meta website and accept License.

  1. clone the repository
    git clone https://github.com/facebookresearch/llama.git
  2. download the models
    cd llama
    ./download.sh
    There are 3 models available 7B, 13B, 70B with different model parallel(MP) values. Model MP
    7B 1
    13B 2
    70B 8

Prepare environment

  1. Create a environment with Pytorch/CUDA installed
  2. in the repository install llama locally.

How to use

inference

All models support sequence length up to 4096 tokens, but we pre-allocate the cache according to max_seq_len and max_batch_size values. So set those according to your hardware.

Pretrained Models

These models are not finetuned for chat or Q&A. They should be prompted so that the expected answer is the natural continuation of the prompt.

See example_text_completion.py for some examples. To illustrate, see the command below to run it with the llama-2-7b model (nproc_per_node needs to be set to the MP value):

torchrun --nproc_per_node 1 example_text_completion.py \
    --ckpt_dir llama-2-7b/ \
    --tokenizer_path tokenizer.model \
    --max_seq_len 128 --max_batch_size 4

Fine-tuned Chat Models

The fine-tuned models were trained for dialogue applications. To get the expected features and performance for them, a specific formatting defined in chat_completion needs to be followed, including the INST and <<SYS>> tags, BOS and EOS tokens, and the whitespaces and breaklines in between (we recommend calling strip() on inputs to avoid double-spaces).

You can also deploy additional classifiers for filtering out inputs and outputs that are deemed unsafe. See the llama-recipes repo for an example of how to add a safety checker to the inputs and outputs of your inference code.

Examples using llama-2-7b-chat:

torchrun --nproc_per_node 1 example_chat_completion.py \
    --ckpt_dir llama-2-7b-chat/ \
    --tokenizer_path tokenizer.model \
    --max_seq_len 512 --max_batch_size 6

Llama 2 is a new technology that carries potential risks with use. Testing conducted to date has not \u2014 and could not \u2014 cover all scenarios. In order to help developers address these risks, we have created the Responsible Use Guide. More details can be found in our research paper as well.

YunchaoYang commented 6 months ago

How to finetune Llama2

Example

This example is from ref[2], where it train a conversation model.

1. Data Preparation

creating a large jsonl file with chunks of conversations separated by a ### token to indicate the speaker.

2. pushed the jsonl file to the hugging face hub:

3. Training with axolotl

Axolotl is used to streamline the fine-tuning of LLMs". All you is just a config a yaml file, to specify the base model and dataset.

references

  1. A poor man's guide to fine-tuning Llama 2, used huggingface, axolotl, wandb and Vast.ai
  2. train on beam: https://dev.to/dhanushreddy29/how-to-finetune-llama-2-a-beginners-guide-219e
YunchaoYang commented 6 months ago

Code LLAMA2

Code Llama is a coding capabilities of Llama 2 . It can generate code, and natural language about code, from both code and natural language prompts It supports many of the most popular languages being used today, including Python, C++, Java, PHP, Typescript (Javascript), C#, and Bash.

references

https://ai.meta.com/blog/code-llama-large-language-model-coding/

YunchaoYang commented 6 months ago

Using LLAMA2 models on Huggingface

LLAMA2 and LangChain

references

  1. https://www.youtube.com/watch?v=MDA3LUKNl1E
  2. https://github.com/curiousily/Get-Things-Done-with-Prompt-Engineering-and-LangChain
  3. https://www.mlexpert.io/prompt-engineering/langchain-quickstart-with-llama-2