MiuLab / Taiwan-LLM

Traditional Mandarin LLMs for Taiwan
https://twllm.com
Apache License 2.0
1.26k stars 104 forks source link

About the spec of instruction tuning dataset #35

Closed HuangChiEn closed 1 year ago

HuangChiEn commented 1 year ago

Thanks for releasing this amazing work. Since both training dataset are currently not available on huggingface due to license concern.

Could you please provide the spec of instruction tuning dataset?

We want to find the alternative tradition chinese dataset for the same spec.

Spec :

  1. the num of instruction sample (* K)
  2. the num of seed task using to generate the task.
adamlin120 commented 1 year ago

Thanks for your interest!

For IFT, v1.0 was trained on ~500k examples (all in mandarin) including manually written examples and examples from proprietary models. Also I wrote ~100 seed QA pairs and paraphrased by model-based approaches.

Lots of interesting mandarin instruction set are released on huggingface by the community. please check them out :)

adamlin120 commented 1 year ago

btw i have re-listed our ift dataset on huggingface https://huggingface.co/datasets/yentinglin/traditional_mandarin_instructions