THUDM / GLM

GLM (General Language Model)
MIT License
3.17k stars 321 forks source link

The pretraining corpus of GLM-Large-Chinese #59

Closed cklsoft closed 1 year ago

cklsoft commented 1 year ago

Hi,

  1. What is the pretraining corpus of GLM-Large-Chinese/GLM-10B-Chinese released ? Wiki+BookCorpus in README or wudao baike zhihu(in config/ds_block_large_chinese.sh) ?
  2. Besides, how large is the corpus used to train GLM-Large-Chinese and GLM-10B-Chinese ? Thanks.
duzx16 commented 1 year ago

Sorry for the mistake in the README. Both Chinese models are pre-trained on WuDaoCorpus (1.1TB), Baidu Baike (87GB) and Zhihu (131GB)。