DachengLi1 / LongChat

Official repository for LongChat and LongEval
Apache License 2.0
504 stars 29 forks source link

How was the 18k dataset prepared? #5

Closed musabgultekin closed 1 year ago

musabgultekin commented 1 year ago

Hi,

Im trying to fine-tune the model with functions support. When I use the Vicuna prepare_all.py data pipeline on the original sharegpt dataset with 16k max token length, I get 60k~ conversations on merged_clean_lang_split_identity_single.json. When I only get the english, I get 45k~

What kind of filtering did you do exactly? For example, did you remove the small conversations?, Did you get only the english? etc

DachengLi1 commented 1 year ago

@musabgultekin I use: clean_sharegpt, optional_clean(keep-lang en), extract_gpt4_only (after this is roughly 15k), and I add non-gpt4 response with > 16K(~2k) , split_long_conversation(but with 16K length), ending up ~18K. I started with a shareGPT with ~80K conversation.

Thx for the question! We are iterating on some data processing and new version training. Will release an official document on the final version of data.

Arist12 commented 1 year ago

Hi,

Im trying to fine-tune the model with functions support. When I use the Vicuna prepare_all.py data pipeline on the original sharegpt dataset with 16k max token length, I get 60k~ conversations on merged_clean_lang_split_identity_single.json. When I only get the english, I get 45k~

What kind of filtering did you do exactly? For example, did you remove the small conversations?, Did you get only the english? etc

Hi @musabgultekin , may I ask where can I get the original sharegpt dataset? Thank you very much!

musabgultekin commented 1 year ago

@Arist12 AFAIK https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered