Thanks for your effort. I have a little confusion about the process. Correct me if I'm wrong. First, we should run block_expansion.py to create our extended model. Then, we clone the repository at https://github.com/hills-code/open-instruct.git@7c2b14d and run finetune_codealpaca.sh. Is this correct?"
Regarding your repo I have some problem in this process too:
1- After running block_expansion.py, a 14.5 GB pytorch_model.bin file will be created. It does not have a pytorch_model.bin.index.json or any other files. However, in the Hugging Face model, there are two shards plus all extra files needed like pytorch_model.bin.index.json, special_tokens_map.json, generation_config.json, config.json. how could we create them?
2- I want to pre train model with my raw text. what should i do? my data is not in your mentioned data like SlimOrca and ....
how could i transform my dataset to work with your codes?
You do not need pytorch_model.bin.index.json. For the other necessary files, you can just copy the original base model.
The code can directly load the dataset from the huggingface use datasets.load_dataset('YOUR_DATASET'). However, if you want to do pretrain, you may need to revise the tokenize function as the tokenize function is used for SFT and will mask the instruction label during the process.
Thanks for your effort. I have a little confusion about the process. Correct me if I'm wrong. First, we should run block_expansion.py to create our extended model. Then, we clone the repository at https://github.com/hills-code/open-instruct.git@7c2b14d and run finetune_codealpaca.sh. Is this correct?"
Regarding your repo I have some problem in this process too: 1- After running block_expansion.py, a 14.5 GB pytorch_model.bin file will be created. It does not have a pytorch_model.bin.index.json or any other files. However, in the Hugging Face model, there are two shards plus all extra files needed like pytorch_model.bin.index.json, special_tokens_map.json, generation_config.json, config.json. how could we create them?
2- I want to pre train model with my raw text. what should i do? my data is not in your mentioned data like SlimOrca and .... how could i transform my dataset to work with your codes?