Closed musabgultekin closed 1 year ago
@musabgultekin I use: clean_sharegpt, optional_clean(keep-lang en), extract_gpt4_only (after this is roughly 15k), and I add non-gpt4 response with > 16K(~2k) , split_long_conversation(but with 16K length), ending up ~18K. I started with a shareGPT with ~80K conversation.
Thx for the question! We are iterating on some data processing and new version training. Will release an official document on the final version of data.
Hi,
Im trying to fine-tune the model with functions support. When I use the Vicuna
prepare_all.py
data pipeline on the original sharegpt dataset with 16k max token length, I get 60k~ conversations onmerged_clean_lang_split_identity_single.json
. When I only get the english, I get 45k~What kind of filtering did you do exactly? For example, did you remove the small conversations?, Did you get only the english? etc
Hi @musabgultekin , may I ask where can I get the original sharegpt dataset? Thank you very much!
Hi,
Im trying to fine-tune the model with functions support. When I use the Vicuna
prepare_all.py
data pipeline on the original sharegpt dataset with 16k max token length, I get 60k~ conversations onmerged_clean_lang_split_identity_single.json
. When I only get the english, I get 45k~What kind of filtering did you do exactly? For example, did you remove the small conversations?, Did you get only the english? etc