e-p-armstrong / augmentoolkit

Convert Compute And Books Into Instruct-Tuning Datasets! Makes: QA, RP, Classifiers.
MIT License
922 stars 123 forks source link

how to use the output/which file to use? #25

Closed ares0027 closed 2 months ago

ares0027 commented 3 months ago

i am using the version from Pinokio, it installs the script by itself. after running it i have 3 output files (my input.txt is 77KB) ;

master_list.jsonl processed_master_list.json simplified_data.jsonl

master_list.jsonl has multiple conversations between and AI Assistant an User, each conversation is one single line. (74KB) processed_master_list.json has just a single line (74 KB) and simplified_data.jsonl has if i am not mistaken sharegpt output in the form of "conversations" "from" "value" etc. (16 KB)

as you can understand i dont even know what i am doing :D so i am trying to use llama factory (local installation) and it asks for "alpaca format" which is instruction, input and output. so;

1- how can i convert this output to that format? 2- how can i use this format, without converting anything? (i am trying to train llama3 but any other model is acceptable)

e-p-armstrong commented 3 months ago

You're right that simplified_data.jsonl is sharegpt. The others contain more information in case people want to add additional context to conversations, but simplified_data.jsonl is the one typically used for training.

What you do here depends on what you mean by "alpaca format". Do you mean the dataset format (like, how the data is organized in the file) or the prompt template (like, how the text appears to the LLM when it's being trained?) Because there exists an alpaca prompt template as well as an alpaca dataset format.

If it's the prompt template you're after you might not have to convert the data file at all. If it's the data, you'll have to convert the file, I haven't used alpaca in training a model before so I can't speak for sure on how you'd represent multiturn conversation with that format. My guess is that you would put the system prompt in the instruction, the previous messages in the input, and the latest assistant message in the output, for each assistant message in each conversation.

worstkid92 commented 3 months ago

You're right that simplified_data.jsonl is sharegpt. The others contain more information in case people want to add additional context to conversations, but simplified_data.jsonl is the one typically used for training.

What you do here depends on what you mean by "alpaca format". Do you mean the dataset format (like, how the data is organized in the file) or the prompt template (like, how the text appears to the LLM when it's being trained?) Because there exists an alpaca prompt template as well as an alpaca dataset format.

If it's the prompt template you're after you might not have to convert the data file at all. If it's the data, you'll have to convert the file, I haven't used alpaca in training a model before so I can't speak for sure on how you'd represent multiturn conversation with that format. My guess is that you would put the system prompt in the instruction, the previous messages in the input, and the latest assistant message in the output, for each assistant message in each conversation.

I am new to these area. Got same confusion. Do you have any examples on how to train using these files besides the Tutorials? I am using a local GPU and I want to finetune llama locally. Thanks a lot for your help.

ares0027 commented 3 months ago

You're right that simplified_data.jsonl is sharegpt. The others contain more information in case people want to add additional context to conversations, but simplified_data.jsonl is the one typically used for training.

What you do here depends on what you mean by "alpaca format". Do you mean the dataset format (like, how the data is organized in the file) or the prompt template (like, how the text appears to the LLM when it's being trained?) Because there exists an alpaca prompt template as well as an alpaca dataset format.

If it's the prompt template you're after you might not have to convert the data file at all. If it's the data, you'll have to convert the file, I haven't used alpaca in training a model before so I can't speak for sure on how you'd represent multiturn conversation with that format. My guess is that you would put the system prompt in the instruction, the previous messages in the input, and the latest assistant message in the output, for each assistant message in each conversation.

thank you so much for your reply. the alpaca format that i am blubbering about is the one that is, i think a json file, and has "instruction:", "input:" and "output:" ones, i guess i am talking about alpaca prompt format.

also i just found that i think i was wrong from the beginning, so instead of going for finetuning i think i am looking for anchoring/RAG? basically i want to provide documents or internal webpages to llm and ask questions about them and maybe ask for it to provide references to the answers.

another thing is i remember from your videos, you had output files with "RAG" in them but i do not have them, is it due to a version difference or am i doing it wrong? (asking in case i am right about going for RAG)

lastly (i apologize for asking this/being fed by spoon) but could you give me a basic workflow of how you finetune/RAG(?) an LLM? i am not asking you to share any files or anything i am asking for "get raw data - use augmentoolkit - take X file to tool Y - use model Z and train" a simplified this would be more than what i could ever ask for :)

regardless, thank you for your response and amazing tool :)

worstkid92 commented 3 months ago

You're right that simplified_data.jsonl is sharegpt. The others contain more information in case people want to add additional context to conversations, but simplified_data.jsonl is the one typically used for training. What you do here depends on what you mean by "alpaca format". Do you mean the dataset format (like, how the data is organized in the file) or the prompt template (like, how the text appears to the LLM when it's being trained?) Because there exists an alpaca prompt template as well as an alpaca dataset format. If it's the prompt template you're after you might not have to convert the data file at all. If it's the data, you'll have to convert the file, I haven't used alpaca in training a model before so I can't speak for sure on how you'd represent multiturn conversation with that format. My guess is that you would put the system prompt in the instruction, the previous messages in the input, and the latest assistant message in the output, for each assistant message in each conversation.

thank you so much for your reply. the alpaca format that i am blubbering about is the one that is, i think a json file, and has "instruction:", "input:" and "output:" ones, i guess i am talking about alpaca prompt format.

also i just found that i think i was wrong from the beginning, so instead of going for finetuning i think i am looking for anchoring/RAG? basically i want to provide documents or internal webpages to llm and ask questions about them and maybe ask for it to provide references to the answers.

another thing is i remember from your videos, you had output files with "RAG" in them but i do not have them, is it due to a version difference or am i doing it wrong? (asking in case i am right about going for RAG)

lastly (i apologize for asking this/being fed by spoon) but could you give me a basic workflow of how you finetune/RAG(?) an LLM? i am not asking you to share any files or anything i am asking for "get raw data - use augmentoolkit - take X file to tool Y - use model Z and train" a simplified this would be more than what i could ever ask for :)

regardless, thank you for your response and amazing tool :)

I guess we are in same confusion. New to this tool. I have no idea how to do with these files. I am using a local GPU and want to finetune a local model. Please @worstkid92 if you have any updates

worstkid92 commented 3 months ago

I now have trouble with the simplified_data.jsonl. This format seems can not be directly used. My codes are like

dataset_raw = load_dataset('json',data_files=jsonl_path,split='train')
trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset=dataset_raw ,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    dataset_num_proc = 2,
    packing = False,
    data_collator=data_collator,...
e-p-armstrong commented 2 months ago

You're right that simplified_data.jsonl is sharegpt. The others contain more information in case people want to add additional context to conversations, but simplified_data.jsonl is the one typically used for training. What you do here depends on what you mean by "alpaca format". Do you mean the dataset format (like, how the data is organized in the file) or the prompt template (like, how the text appears to the LLM when it's being trained?) Because there exists an alpaca prompt template as well as an alpaca dataset format. If it's the prompt template you're after you might not have to convert the data file at all. If it's the data, you'll have to convert the file, I haven't used alpaca in training a model before so I can't speak for sure on how you'd represent multiturn conversation with that format. My guess is that you would put the system prompt in the instruction, the previous messages in the input, and the latest assistant message in the output, for each assistant message in each conversation.

thank you so much for your reply. the alpaca format that i am blubbering about is the one that is, i think a json file, and has "instruction:", "input:" and "output:" ones, i guess i am talking about alpaca prompt format.

also i just found that i think i was wrong from the beginning, so instead of going for finetuning i think i am looking for anchoring/RAG? basically i want to provide documents or internal webpages to llm and ask questions about them and maybe ask for it to provide references to the answers.

another thing is i remember from your videos, you had output files with "RAG" in them but i do not have them, is it due to a version difference or am i doing it wrong? (asking in case i am right about going for RAG)

lastly (i apologize for asking this/being fed by spoon) but could you give me a basic workflow of how you finetune/RAG(?) an LLM? i am not asking you to share any files or anything i am asking for "get raw data - use augmentoolkit - take X file to tool Y - use model Z and train" a simplified this would be more than what i could ever ask for :)

regardless, thank you for your response and amazing tool :)

Good questions! The video demos cover how to finetune, showing every step. RAG vs Non-RAG data is just that RAG has the context from which the questions were made, in the system prompt, so that the LLM learns retrieval.