Open andymvp2018 opened 1 month ago
I'm not sure what exact script was used, but something like https://github.com/allenai/OLMo/blob/main/scripts/lumi/mitchish70.sh may be adaptable to your purposes. That script does not set an architecture-related settings.
Thanks @2015aroras , two questions:
B"$PROJECT_DIR:$PROJECT_DIR" \ -B"$FLASH_DIR:$FLASH_DIR" \ -B"$SCRATCH_DIR:$SCRATCH_DIR" \ -B /opt/cray:/opt/cray \ -B /usr/lib64/libcxi.so.1:/usr/lib64/libcxi.so.1 \ -B /usr/lib64/libjson-c.so.3:/usr/lib64/libjson-c.so.3 \ $PROJECT_DIR/containers/$OLMO_CONTAINER \
The global batch size carries the size of the batch in the current step. We split a batch across our GPUs, so each device has a smaller 'device' batch size (global size / num devices). A GPU doesn't have enough memory to do the whole device batch in 1 forward + backward pass, so we split the device into multiple micro batches and do separate forward + backward passes. After all the micro batches are done, we do the optimizer step. Overall, micro batch size is just about avoiding memory issues & getting good perf; it should not affect training results. You'll want the micro batch size to be a divisor of the device batch size.
Our slurm jobs run in singularity containers (maybe there are ways to use other types of containers in your system). The -B
is mounting directories from outside the container into the container. $PROJECT_DIR/containers/$OLMO_CONTAINER
is the location of the container
❓ The question
do you know the slurm script for configs/official/OLMo-7B.yaml? looking for multi-node slurm script