slurm script for: configs/official/OLMo-7B.yaml

andymvp2018 commented 1 month ago

❓ The question

do you know the slurm script for configs/official/OLMo-7B.yaml? looking for multi-node slurm script

2015aroras commented 1 month ago

I'm not sure what exact script was used, but something like https://github.com/allenai/OLMo/blob/main/scripts/lumi/mitchish70.sh may be adaptable to your purposes. That script does not set an architecture-related settings.

andymvp2018 commented 1 month ago

Thanks @2015aroras , two questions:

If I set micro_train_device batch size, this will over-ride the global batch size right?
what are these?

B"$PROJECT_DIR:$PROJECT_DIR" \ -B"$FLASH_DIR:$FLASH_DIR" \ -B"$SCRATCH_DIR:$SCRATCH_DIR" \ -B /opt/cray:/opt/cray \ -B /usr/lib64/libcxi.so.1:/usr/lib64/libcxi.so.1 \ -B /usr/lib64/libjson-c.so.3:/usr/lib64/libjson-c.so.3 \ $PROJECT_DIR/containers/$OLMO_CONTAINER \

2015aroras commented 1 month ago

The global batch size carries the size of the batch in the current step. We split a batch across our GPUs, so each device has a smaller 'device' batch size (global size / num devices). A GPU doesn't have enough memory to do the whole device batch in 1 forward + backward pass, so we split the device into multiple micro batches and do separate forward + backward passes. After all the micro batches are done, we do the optimizer step. Overall, micro batch size is just about avoiding memory issues & getting good perf; it should not affect training results. You'll want the micro batch size to be a divisor of the device batch size.
Our slurm jobs run in singularity containers (maybe there are ways to use other types of containers in your system). The -B is mounting directories from outside the container into the container. $PROJECT_DIR/containers/$OLMO_CONTAINER is the location of the container

allenai / OLMo

slurm script for: configs/official/OLMo-7B.yaml #699

❓ The question