In BERT pretraining how to specify DATA_PATH to take multiple files

armundle commented 3 years ago

I am trying to train Megatron-LM/BERT with wiki dataset. I followed the instructions to download and pre-process the latest wiki dump and extract the txt with WikiExtractor.py as described here

For wikiextractor, the command looks like:

python -m wikiextractor.WikiExtractor data/enwiki-latest-pages-articles.xml.bz2 -o data/wiki --json

This creates json files with folders as

.
├── AA
├── AB
├── AC
├── AD
├── AE
├── ...
├── GD
└── GE

where each folder has wiki_<00-99>json files

├── wiki_00
├── wiki_01
├── ...
├── wiki_98
└── wiki_99

Each json file has multiple objects, one JSON per line, e.g.

{...
"id": "620", "revid": "15996738", "url": "https://en.wikipedia.org/wiki?curid=620", "title": "Animal Farm", "text": "Animal Farm is an allegorical novella by George Orwell, first pub    lished in ...  but ran in Brazilian and Burmese newspapers."
...
}

This generates a total of 16046 such JSON files. In order to pre-process them with the preprocess_data.py script, I had to create bash script which would process all of these files and create ~ 16k pairs of .idx and .bin files

SCRIPT=($1)
VOCAB=($2)
WIKI_DIR=($3)
OUTDIR=($4)

find "$WIKI_DIR" -type f  -print0 |
        while IFS= read -r -d '' line; do
                filename=$(echo "$line" | rev | cut -d'/' -f 1 | rev)
                subfilename=$(echo "$line" | rev | cut -d'/' -f 2 | rev)
                prefix="${subfilename}_${filename}"
                echo "Procesing $prefix"
                python $SCRIPT --input $line --output-prefix ${OUTDIR}/megatron-bert-${prefix}  --vocab $VOCAB --dataset-impl mmap  --tokenizer-type BertWordPieceLowerCase --split-sentences
        done

This creates ~16k .idx and .bin pairs e.g.

megatron-bert-AA_wiki_00_text_sentence.bin,
megatron-bert-AA_wiki_00_text_sentence.idx,
...
megatron-bert-GE_wiki_47_text_sentence.bin,
megatron-bert-GE_wiki_47_text_sentence.idx

With all this pre-processing done, now I want to train the BERT with pretrain_bert.py as outlined here. But it looks like the DATA_PATH variable only takes a single .idx/.bin pair suffix.

How do I use all of the 16k+ pairs of .idx/.bin for BERT pretraining?

Lyken17 commented 3 years ago

Same question here, should we merge all jsons into one file?

The script I used to merge wiki_xx into one file

SCRIPT=($1)
VOCAB=($2)
WIKI_DIR=($3)
OUTDIR=($4)

mkdir -p $OUTDIR
rm $OUTDIR/wiki_all.json
touch $OUTDIR/wiki_all.json

find "$WIKI_DIR" -type f  -print0 |
    while IFS= read -r -d '' line; do
            filename=$(echo "$line" | rev | cut -d'/' -f 1 | rev)
            subfilename=$(echo "$line" | rev | cut -d'/' -f 2 | rev)
            prefix="${subfilename}_${filename}"
            new_name=$(echo "$line")
            echo "Procesing $prefix, $filename, $new_name"
            cat $new_name >> $OUTDIR/wiki_all.json
    done

python tools/preprocess_data.py \
       --input $OUTDIR/wiki_all.json \
       --output-prefix $OUTDIR/my-bert \
       --vocab bert-vocab.txt \
       --dataset-impl mmap \
       --tokenizer-type BertWordPieceLowerCase \
       --split-sentences

haim-barad commented 3 years ago

Yeah - it turns out that you can/should merge into a single file and process that.

Lyken17 commented 3 years ago

@haim-barad Have u ever played with single-file-generated bin for large scale training? I am concerned about whether single file will lead to IO bottleneck.

shoeybi commented 3 years ago

We have not had any issues with large single files using lustre or nfs file systems. If you want to use multiple files, you can break your input data into multiple files and use dataset blending (example here).

github-actions[bot] commented 1 year ago

Marking as stale. No activity in 60 days. Remove stale label or comment or this will be closed in 7 days.

github-actions[bot] commented 1 year ago

Marking as stale. No activity in 60 days.

NVIDIA / Megatron-LM

In BERT pretraining how to specify DATA_PATH to take multiple files #117