aws-samples / aws-parallelcluster-megatron

MIT No Attribution
13 stars 7 forks source link

Binary file creation failure #1

Open amrragab8080 opened 3 years ago

amrragab8080 commented 3 years ago

Running through the data prep stages I am getting an error during the binary file creation step. command line:

python3.8 /root/megatron/tools/preprocess_data.py --input /data/wiki/mergedfile.json --output-prefix my-gpt2 --vocab /data/gpt2/gpt2-vocab.json --dataset-impl mmap --tokenizer-type GPT2BPETokenizer --merge-file /data/gpt2/gpt2-merges.txt --append-eod --workers 70

error:

Processed 94600 documents (24993.62426965924 docs/s, 19.35331771787548 MB/s).
Processed 94700 documents (25008.826667177083 docs/s, 19.35876833851978 MB/s).
Processed 94800 documents (25024.40758830729 docs/s, 19.369831297664188 MB/s).
Processed 94900 documents (25035.71006438949 docs/s, 19.380129956422056 MB/s).
Processed 95000 documents (25037.019340941308 docs/s, 19.381847915168716 MB/s).
Processed 95100 documents (25003.31820662113 docs/s, 19.360489537557633 MB/s).
Processed 95200 documents (25014.108701169043 docs/s, 19.359326478313687 MB/s).
Processed 95300 documents (24980.31228486107 docs/s, 19.330831118094995 MB/s).
Processed 95400 documents (24992.166169148823 docs/s, 19.338341927391188 MB/s).
Processed 95500 documents (25004.716646922676 docs/s, 19.341969667809646 MB/s).
multiprocessing.pool.RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/multiprocessing/pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
  File "/usr/local/lib/python3.8/multiprocessing/pool.py", line 48, in mapstar
    return list(map(*args))
  File "/root/megatron/tools/preprocess_data.py", line 79, in encode
    data = json.loads(json_line)
  File "/usr/local/lib/python3.8/json/__init__.py", line 357, in loads
    return _default_decoder.decode(s)
  File "/usr/local/lib/python3.8/json/decoder.py", line 337, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/usr/local/lib/python3.8/json/decoder.py", line 353, in raw_decode
    obj, end = self.scan_once(s, idx)
json.decoder.JSONDecodeError: Expecting ',' delimiter: line 1 column 13873 (char 13872)
"""
amrragab8080 commented 3 years ago

Tried multiple wiki dumps all result in the same error (just after a different number of processed docs)

amrragab8080 commented 3 years ago

I was able to diagnose the problem the issue is that this statement in your instructions is not thread-safe with certain file systems. parallel cats of the json was causing invalid json to be in the merged file. find /lustre/data/wiki/text/ -name wiki* | parallel -m -j 70 "cat {} >> mergedfile.json" you can replace it with find /lustre/data/wiki/text/ -name wiki* | parallel -m -j 1 "cat {} >> mergedfile.json" without impacting performance.

mhuguesaws commented 3 years ago

@amrragab8080 thanks for reporting. Could you submit a PR with the fix?