TencentARC / LLaMA-Pro

[ACL 2024] Progressive LLaMA with Block Expansion.
https://tencentarc.github.io/LLaMA-Pro/
Apache License 2.0
449 stars 34 forks source link

guide to run the code #11

Open Abolfazl-kr opened 5 months ago

Abolfazl-kr commented 5 months ago

Thanks for your effort. I have a little confusion about the process. Correct me if I'm wrong. First, we should run block_expansion.py to create our extended model. Then, we clone the repository at https://github.com/hills-code/open-instruct.git@7c2b14d and run finetune_codealpaca.sh. Is this correct?"

Regarding your repo I have some problem in this process too: 1- After running block_expansion.py, a 14.5 GB pytorch_model.bin file will be created. It does not have a pytorch_model.bin.index.json or any other files. However, in the Hugging Face model, there are two shards plus all extra files needed like pytorch_model.bin.index.json, special_tokens_map.json, generation_config.json, config.json. how could we create them?

2- I want to pre train model with my raw text. what should i do? my data is not in your mentioned data like SlimOrca and .... how could i transform my dataset to work with your codes?

hills-code commented 5 months ago
  1. You do not need pytorch_model.bin.index.json. For the other necessary files, you can just copy the original base model.
  2. The code can directly load the dataset from the huggingface use datasets.load_dataset('YOUR_DATASET'). However, if you want to do pretrain, you may need to revise the tokenize function as the tokenize function is used for SFT and will mask the instruction label during the process.
kiran-coditation commented 2 months ago

Hi @Abolfazl-kr are you able to pretrain after block-expansion? If yes can you please guide me for the same