LC1332 / Chat-Haruhi-Suzumiya

Chat凉宫春日, An open sourced Role-Playing chatbot Cheng Li, Ziang Leng, and others.
Apache License 2.0
1.85k stars 164 forks source link

Arxiv Code and Data Release #17

Open LC1332 opened 1 year ago

LC1332 commented 1 year ago

Arxiv相关的代码、demo和数据将在一到两周内release

The code, demos, and data related to Arxiv will be released within one to two weeks.

Zhuqln commented 1 year ago

hello Geniuses! thanks for your amazing works! i notice that you guys relese the 54k-dataset. but i didnt found the way to format the raw data for training. is there a chance to know how your impressive work is carried out in preparing the dataset?

LC1332 commented 1 year ago

When you say raw data

you mean the raw novel text data, or the text data we finally ask language model to learn

for the latter see a dataset we've uploaded https://huggingface.co/datasets/silk-road/Chat_Suzumiya_Fusion in this link

for raw data extracting most of those code are at kyon_generator folder and we may clean it later

for novel data extracting see the notebook here

https://github.com/LC1332/Prophet-Andrew-Ng/blob/main/langchain/%E6%9D%8E%E9%B2%81%E9%B2%81%E5%AD%A6LangChain_25_Kor%E4%BF%A1%E6%81%AF%E6%8A%BD%E5%8F%96.ipynb

the project are still in building and I will release a tutorial later, to figure out how to extract dialogues from a novel