Closed armundle closed 2 months ago
Same question here, should we merge all jsons into one file?
The script I used to merge wiki_xx
into one file
SCRIPT=($1)
VOCAB=($2)
WIKI_DIR=($3)
OUTDIR=($4)
mkdir -p $OUTDIR
rm $OUTDIR/wiki_all.json
touch $OUTDIR/wiki_all.json
find "$WIKI_DIR" -type f -print0 |
while IFS= read -r -d '' line; do
filename=$(echo "$line" | rev | cut -d'/' -f 1 | rev)
subfilename=$(echo "$line" | rev | cut -d'/' -f 2 | rev)
prefix="${subfilename}_${filename}"
new_name=$(echo "$line")
echo "Procesing $prefix, $filename, $new_name"
cat $new_name >> $OUTDIR/wiki_all.json
done
python tools/preprocess_data.py \
--input $OUTDIR/wiki_all.json \
--output-prefix $OUTDIR/my-bert \
--vocab bert-vocab.txt \
--dataset-impl mmap \
--tokenizer-type BertWordPieceLowerCase \
--split-sentences
Yeah - it turns out that you can/should merge into a single file and process that.
@haim-barad Have u ever played with single-file-generated bin for large scale training? I am concerned about whether single file will lead to IO bottleneck.
We have not had any issues with large single files using lustre or nfs file systems. If you want to use multiple files, you can break your input data into multiple files and use dataset blending (example here).
Marking as stale. No activity in 60 days. Remove stale label or comment or this will be closed in 7 days.
Marking as stale. No activity in 60 days.
I am trying to train Megatron-LM/BERT with wiki dataset. I followed the instructions to download and pre-process the latest wiki dump and extract the txt with WikiExtractor.py as described here
For
wikiextractor
, the command looks like:python -m wikiextractor.WikiExtractor data/enwiki-latest-pages-articles.xml.bz2 -o data/wiki --json
This creates json files with folders as
where each folder has
wiki_<00-99>
json filesEach json file has multiple objects, one JSON per line, e.g.
This generates a total of 16046 such JSON files. In order to pre-process them with the preprocess_data.py script, I had to create bash script which would process all of these files and create ~ 16k pairs of .idx and .bin files
This creates ~16k .idx and .bin pairs e.g.
With all this pre-processing done, now I want to train the BERT with pretrain_bert.py as outlined here. But it looks like the
DATA_PATH
variable only takes a single .idx/.bin pair suffix.How do I use all of the 16k+ pairs of .idx/.bin for BERT pretraining?