Open JulianBvW opened 1 year ago
One way would be to use the Dolly 2.0 JSON file as a template and structure your dataset in the same fashion, using the same keys. And then run the prepare_dolly script.
Or, maybe even easier would be to structure it similar to the Alpaca dataset, which has slightly different names for the keys, and then use the prepare_alpaca script.
Yes, but the question is how i can automatically fill instruction, input, and output using the web-scraped texts from the wiki pages?
I'm also looking for the best way of creating dataset. I suppose we have to manually create some dataset (instructions/ output) manually and then can use self instruct to expand this and use for training.
I'm not sure how much data do we need to create and what should be the length of each instruction, response.
Is there a more systematic way of creating manual data?
Yes, but the question is how i can automatically fill instruction, input, and output using the web-scraped texts from the wiki pages?
Oh, I think I now understand what you mean. Essentially, you don't have an instruction-finetuning dataset, correct? Or in other words, it's an "unlabeled" dataset? One way would be creating an instruction dataset by imitation learning; this would involve using another LLM (like GPT-4, e.g., via the API) to generate an instruction dataset from your dataset. This is essentially how the Alpaca dataset itself was created as well (for more details: https://github.com/tatsu-lab/stanford_alpaca#data-generation-process)
Or, if you are not interested in instruction-finetuning, I guess you could use your dataset with the pretraining script to further train the model via next-word prediction on the custom dataset.
I'm very new to this, so I apologize if it's incorrect, but I believe you can just follow the unstructured data guide or adapt the prepare_shakespeare.py code.
I want to fine-tune LLaMa on data I got from a fandom wiki (for example this page) and was wondering how to design the json file with its "prompt", "input", and "output"?
I can't just only use the prompt "Write the next sentance" and then put two adjacent sentences in input and output, right?