Hi, thanks for the great work! I have some questions about how you preprocess the Wikipedia dataset. I'm not sure whether the LDDL package is used when you preprocess the pre-training data, according to the script link you provide.
The model of your work is base on NVIDIA/BERT model, and according to their README.md, they will run create_datasets_from_start.sh to download Wikipedia by lddl donwloader(download_wikipedia) and use run_pretraining.sh to preprocess the data by lddl preprocessor(preprocess_bert_pretrain) and load balancer(balance_dask_output). These preprocessed data, in the form of "balanced parquet shards" accorting to LDDL documentation, will then be sent to lddl.torch.get_bert_pretrain_data_loader for BERT pre-training usage.
For the data download and preprocess script data_download.sh you provide, in my understanding, is the old version of downloading and preprocessing data for NVIDIA/BERT and no longer used by them now (I check this commit history to confirm this). After running data_dowload.sh, it will run the old version of create_datasets_from_start.sh and create a hdf5 file without using the LDDL package. But I notice that in your src/run_pretraining.py, line 261, you do use lddl.torch.get_bert_pretrain_data_loader. This has left me a little confused about which version of preprocessing I should use to reproduce your results. Should I preprocess the data using LDDL package, or use data_download.sh to create hdf5 file. Additionally, I noticed that many scripts mentioned in 'data_download.sh' (download_wikipedia.sh, old version of create_datasets_from_start.sh, etc.) no longer exist on the current version of NVIDIA/BERT now, it will be helpful if you can provide these scripts you use or specify the commit version you reference to!
Thanks for the time to read my questions. I appreciate your kind help!
Hi, thanks for the great work! I have some questions about how you preprocess the Wikipedia dataset. I'm not sure whether the LDDL package is used when you preprocess the pre-training data, according to the script link you provide.
The model of your work is base on NVIDIA/BERT model, and according to their README.md, they will run
create_datasets_from_start.sh
to download Wikipedia by lddl donwloader(download_wikipedia
) and userun_pretraining.sh
to preprocess the data by lddl preprocessor(preprocess_bert_pretrain
) and load balancer(balance_dask_output
). These preprocessed data, in the form of "balanced parquet shards" accorting to LDDL documentation, will then be sent tolddl.torch.get_bert_pretrain_data_loader
for BERT pre-training usage.For the data download and preprocess script data_download.sh you provide, in my understanding, is the old version of downloading and preprocessing data for NVIDIA/BERT and no longer used by them now (I check this commit history to confirm this). After running
data_dowload.sh
, it will run the old version ofcreate_datasets_from_start.sh
and create ahdf5
file without using the LDDL package. But I notice that in yoursrc/run_pretraining.py
, line 261, you do uselddl.torch.get_bert_pretrain_data_loader
. This has left me a little confused about which version of preprocessing I should use to reproduce your results. Should I preprocess the data using LDDL package, or usedata_download.sh
to createhdf5
file. Additionally, I noticed that many scripts mentioned in 'data_download.sh' (download_wikipedia.sh
, old version ofcreate_datasets_from_start.sh
, etc.) no longer exist on the current version of NVIDIA/BERT now, it will be helpful if you can provide these scripts you use or specify the commit version you reference to!Thanks for the time to read my questions. I appreciate your kind help!