Closed saforem2 closed 4 months ago
e.g. to reproduce the GPT 1.5B model:
PBS_O_WORKDIR=$(pwd) NO_LLAMA=1 TOKENIZER_TYPE="gpt" NLAYERS=48 HIDDEN=1536 SEQ=2048 HEADS=24 bash train_llama_alcf.sh
though, for some reason, this seems to crash only on the first attempt (?? 🤔) with:
[rank1]: ValueError: mmap length is greater than file size
[Take 1]: running the first time seems to cause it to crash with:
[rank1]: ValueError: mmap length is greater than file size
traceback
:[Take 2]: Re-running, immediately after (in the same session, with the exact same command), it then seems to work fine (??):
traceback
:🤷🏻♂️
though, now that I'm thinking about it, I've also got the changes from @zhenghh04's PR (Distributed data loading #16) that might be impacting the behavior here ??
$ git status
On branch llama-toggle
Your branch is up to date with 'origin/llama-toggle'.
Changes not staged for commit:
(use "git add <file>..." to update what will be committed)
(use "git restore <file>..." to discard changes in working directory)
modified: ALCF/test_blendable_dataset.py
modified: megatron/__init__.py
modified: megatron/data/blendable_dataset.py
modified: megatron/data/gpt_dataset.py
modified: megatron/model/transformer.py
modified: megatron/utils.py
at any rate, since it works, and we're not planning to target GPT architecture explicitly in the future[^isolated],
this seems good enough for now ??
[^isolated]: and, as far as I can tell, this issue only seems to pop up when using the the GPT2BPETokenizer
with the BookCorpusDataset
Adds ability to toggle
"${LLAMA_ARGS}"
on / off by setting:which will disable the
"${LLAMA_ARGS}"
that are built in therun_cmd
from here