bigscience-workshop / data_tooling

Tools for managing datasets for governance and training.
Apache License 2.0
77 stars 49 forks source link

Create dataset du_reader #357

Closed albertvillanova closed 2 years ago

albertvillanova commented 2 years ago
albertvillanova commented 2 years ago

DONE: https://huggingface.co/datasets/bigscience-catalogue-data/du_reader_lm

Sample:

{'text': '茄子卤面的做法如下一、材料虾仁100g、茄子150g、面条350g、食盐适量、酱油适量、味精适量、姜适量、蚝油适量、调和油适量、柿子椒适量、洋葱适量、黑木耳适量、苦菜适量、香菇适量、胡萝卜适量、鸡蛋1只。二、做法1、虾去头去皮去肠,洗净备用。2、干香菇提前洗干净泡发,切成块继续放在水里泡。3、茄子去皮切成片。4、木耳、黄花菜提前泡发好回刀。洋葱、胡萝卜、辣椒切成块。5、炒锅放油加热,茄子入锅煎后盛出。6、炒锅放油加热,姜丝入锅煸香后,加虾仁、洋葱、茄子、辣椒、胡萝卜、木耳、黄花菜酱油、蚝油入锅翻炒。7、香菇连同泡香菇的水一同入锅。8、再加适量的水,盖上锅盖烧开,加盐、味精,搅拌均匀。9、淋入芡汁、鸡蛋,搅拌均匀,再煮开后关火。10、煮锅加水烧开,煮熟面条。11、面条出锅,配上卤、黄瓜即可。'}
mariosasko commented 2 years ago

self-assign

mariosasko commented 2 years ago

The LM script is available here: https://huggingface.co/datasets/bigscience-catalogue-lm-data/lm_zh_du_reader

albertvillanova commented 2 years ago

Thanks @mariosasko