CASIA-LM / ChineseWebText

149 stars 13 forks source link

ChineseWebText: Large-Scale High-quality Chinese Web Text Extracted with Effective Evaluation Model

This directory contains the ChineseWebText dataset, and the EvalWeb tool-chain to process CommonCrawl Data for high-quality chinese data. Our ChineseWebText dataset is publicly available on huggingface (here) .

ChineseWebText

We release the latest and largest Chinese dataset ChineseWebText, which consists of 1.42 TB (See Table 1) data and each text is assigned a quality score, facilitating LLM researchers to select data according to a new quality threshold. We also release a much cleaner subset of 600 GB Chinese texts with quality exceeding 90% .

EvalWeb

Introduction

We introduce a new complete tool-chain EvalWeb (See Figure 1), which could extract high-quality Chinese texts from raw web data. For the crawled data from web, we first use a preparation module to process them, and then extract the monolingual Chinese data. After that, a preprocessing module will be used to further filter them with mannual crafted rules, including data length, sensitive words, proportion of Chinese characters and so on. Finally, a BERT-based evaluation model will be employed to assess the qualities of filtered data. By this way, we can generate a quality score for each of the text, and then use an appropriate threshold to extract the high-quality data as we required. Furthermore, considering computational cost and efficiency, we also propose to leverage knowledge distillation techniques to train a FastText classifier, which can achieve similar performance with faster efficiency and lower computational costs.

Environment Dependencies

codescikit-learn==1.3.0
transformers==4.31.0
scipy==1.11.1
numpy==1.24.3
pytorch==2.0.1
jieba==0.42.1
zhconv==1.4.3
fasttext==0.9.2

Stage 1: Data Preparation

1. Deduplication and Language Identification (LID) using CCNet Tools

{
  "hash_in_mem": 10,
  "dump": "2023-23",
  "task_parallelism": 20,
  "num_shards": 5000,
  "mine_num_processes": 20,
  "num_segments_per_shard":-1,
  "lang_whitelist": ["zh","en"],
  "lang_blacklist": [],
  "lang_threshold": 0.5,
  "keep_bucket": [],
  "pipeline": ["dedup", "lid", "keep_lang", "sp", "lm", "pp_bucket", "drop", "split_by_lang"],
  "metadata": "None",
  "execution": "local",
  "output_dir": "data",
  "mined_dir": "mined",
  "target_size": "4G",
  "min_len": 300,
  "cache_dir": "/mnt/data/ccnet_data/commoncrawl"
}

2. Splitting data and merging into jsonl files.

python merge2jsonl.py --source /mnt/data/ccnet_clean/cc_net/data/mined_split/2023-23 --target /mnt/data/cc_cleaned/2023-23
cleared*.jsonl

Stage 2: Preprocessing

This section focuses on extracting high-quality texts from Chinese monolingual web data by using manually crafted rules to filter out violent, pornographic, advertising content, and erroneous characters. The details of the filtering rules are presented in the following:

Extract text content from jsonl file after the data preparation stage.

To improve language model training, documents will be filtered out if they have an average line length of fewer than 10 characters or a total text length of less than 200 characters, as such short texts often lack meaningful context and semantic relevance.

We aim to create a high-quality simplified Chinese dataset from web data by eliminating traditional Chinese characters and removing texts with less than 30% Chinese characters to ensure the dataset is suitable for training large language models.

To prevent large language models from generating toxic content, a method is proposed where texts are analyzed for the occurrence of harmful words from a predefined list, and any text with more than 0.5 occurrences of such words per line is classified as toxic and removed from the training dataset.

To enhance training efficiency and model performance, a subsequent analysis using a 13-gram granularity is conducted to identify and filter out data samples where over 50% of the character sequences are repetitive in each data entry.

Here is an example command to run the preprocessing stage:

python preprocess.py --dates 2023-06 2023-14

The "dates" parameter passed in corresponds to the folder names of the snapshots generated during the preparation stage.

Then, you will get six subfolders under the corresponding date's folder. These six folders are respectively named "text_extraction", "length", "Character", "sensitive", "duplication" and "remain". The "text_extraction" folder contains the results after extracting text from each piece of data, while "length", "Character", "sensitive", and "duplication" correspond to four filtering operations, storing the filtered noise data. The "remain" folder stores the remaining data after the preprocessing stage, and these data will subsequently be scored through our evaluation model.

Stage 3: Quality Evaluation

In preprocessing procedure, we have used some handcrafted rules to remove the explicit noisy texts from our dataset. However, within the remaining data, there is still a considerable amount of low-quality text data, which cannot be filtered out with handcrafted rules. In order to extract the data of higher quality from them, in this section we further propose to design an evaluation models.

Stage 3.1: BERTEval

1. BERTEval Training Data Composition

2. BERTEval Training and Inference

Stage 3.2: FastText

1. FastText Training Data Composition

2. FastText Training and Inference

We provide our FastText training data examples and training script in folder "fasttext". You can download our trained FastText model here and replace the existing file located at "./fasttext/output/model.bin".

cd fasttext
python main.py --mode train --train_file ./data/train.txt --test_file ./data/test.txt

To understand the process of constructing the "train.txt" and "test.txt" files, please refer to the "./data/build_data.py".

The trained model "model.bin" will be stored in the "output" folder.

After getting the remaining data after the preprocessing stage(should be stored in path like "./2023-06/remain"), you can using our FastText model to score all the data:

python main.py --mode test --dates 2023-06 2023-14

This step will assign a FastText score to each data entry, with the results being stored in a directory such as "./2023-06/remain/fasttext". Subsequently, you can utilize these scores to filter and extract high-quality data by using a threshold(default set to 0.5).

Citation

Please cite the paper if you use the data or code in this repo.

@misc{chen2023chinesewebtext,
      title={ChineseWebText: Large-scale High-quality Chinese Web Text Extracted with Effective Evaluation Model}, 
      author={Jianghao Chen and Pu Jian and Tengxiao Xi and Dongyi Yi and Qianlong Du and Chenglin Ding and Guibo Zhu and Chengqing Zong and Jinqiao Wang and Jiajun Zhang},
      year={2023},
      eprint={2311.01149},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}