Question about data preprocessing for pre-training

Hi, thanks for the great work! I have some questions about how you preprocess the Wikipedia dataset. I'm not sure whether the LDDL package is used when you preprocess the pre-training data, according to the script link you provide.

The model of your work is base on NVIDIA/BERT model, and according to their README.md, they will run create_datasets_from_start.sh to download Wikipedia by lddl donwloader(download_wikipedia) and use run_pretraining.sh to preprocess the data by lddl preprocessor(preprocess_bert_pretrain) and load balancer(balance_dask_output). These preprocessed data, in the form of "balanced parquet shards" accorting to LDDL documentation, will then be sent to lddl.torch.get_bert_pretrain_data_loader for BERT pre-training usage.

For the data download and preprocess script data_download.sh you provide, in my understanding, is the old version of downloading and preprocessing data for NVIDIA/BERT and no longer used by them now (I check this commit history to confirm this). After running data_dowload.sh, it will run the old version of create_datasets_from_start.sh and create a hdf5 file without using the LDDL package. But I notice that in your src/run_pretraining.py, line 261, you do use lddl.torch.get_bert_pretrain_data_loader. This has left me a little confused about which version of preprocessing I should use to reproduce your results. Should I preprocess the data using LDDL package, or use data_download.sh to create hdf5 file. Additionally, I noticed that many scripts mentioned in 'data_download.sh' (download_wikipedia.sh, old version of create_datasets_from_start.sh, etc.) no longer exist on the current version of NVIDIA/BERT now, it will be helpful if you can provide these scripts you use or specify the commit version you reference to!

Thanks for the time to read my questions. I appreciate your kind help!

Hannibal046 / PlugLM

Question about data preprocessing for pre-training #2