Open xbzjsj opened 5 months ago
Will the use of single GPU run existing overflow?
Hi! You can adapt to use 1 gpu like so:
Replace something that looks like this:
(trap 'kill 0' SIGINT; \
python dist_inference_runner.py $(echo ${ARGS}) --cuda-id 0 --rank 0 \
&
python dist_inference_runner.py $(echo ${ARGS}) --cuda-id 1 --rank 1 \
&
python dist_inference_runner.py $(echo ${ARGS}) --cuda-id 2 --rank 2 \
&
python dist_inference_runner.py $(echo ${ARGS}) --cuda-id 3 --rank 3 \
&
python dist_inference_runner.py $(echo ${ARGS}) --cuda-id 4 --rank 4 \
&
python dist_inference_runner.py $(echo ${ARGS}) --cuda-id 5 --rank 5 \
&
python dist_inference_runner.py $(echo ${ARGS}) --cuda-id 6 --rank 6 \
&
python dist_inference_runner.py $(echo ${ARGS}) --cuda-id 7 --rank 7 \
& \
wait)
With this:
python dist_inference_runner.py $(echo ${ARGS}) --cuda-id 0 --rank 0
Hi! You can adapt to use 1 gpu like so:
Replace something that looks like this:
(trap 'kill 0' SIGINT; \ python dist_inference_runner.py $(echo ${ARGS}) --cuda-id 0 --rank 0 \ & python dist_inference_runner.py $(echo ${ARGS}) --cuda-id 1 --rank 1 \ & python dist_inference_runner.py $(echo ${ARGS}) --cuda-id 2 --rank 2 \ & python dist_inference_runner.py $(echo ${ARGS}) --cuda-id 3 --rank 3 \ & python dist_inference_runner.py $(echo ${ARGS}) --cuda-id 4 --rank 4 \ & python dist_inference_runner.py $(echo ${ARGS}) --cuda-id 5 --rank 5 \ & python dist_inference_runner.py $(echo ${ARGS}) --cuda-id 6 --rank 6 \ & python dist_inference_runner.py $(echo ${ARGS}) --cuda-id 7 --rank 7 \ & \ wait)
With this:
python dist_inference_runner.py $(echo ${ARGS}) --cuda-id 0 --rank 0
and don't forget to change world-size
and pipeline-group-size
Hi! You can adapt to use 1 gpu like so: Replace something that looks like this:
(trap 'kill 0' SIGINT; \ python dist_inference_runner.py $(echo ${ARGS}) --cuda-id 0 --rank 0 \ & python dist_inference_runner.py $(echo ${ARGS}) --cuda-id 1 --rank 1 \ & python dist_inference_runner.py $(echo ${ARGS}) --cuda-id 2 --rank 2 \ & python dist_inference_runner.py $(echo ${ARGS}) --cuda-id 3 --rank 3 \ & python dist_inference_runner.py $(echo ${ARGS}) --cuda-id 4 --rank 4 \ & python dist_inference_runner.py $(echo ${ARGS}) --cuda-id 5 --rank 5 \ & python dist_inference_runner.py $(echo ${ARGS}) --cuda-id 6 --rank 6 \ & python dist_inference_runner.py $(echo ${ARGS}) --cuda-id 7 --rank 7 \ & \ wait)
With this:
python dist_inference_runner.py $(echo ${ARGS}) --cuda-id 0 --rank 0
and don't forget to change
world-size
andpipeline-group-size
It still gives some errors related to NCCL.
file=./c4_train/c4_train.jsonl
echo "start running ${file}"
# ARGS="--model-name /lustre/fsw/nvresearch/ldm/diffusion/checkpoint/opt-175b-new \
ARGS="--model-name /root/shared/opt-30b-new \
--model-type opt \
--seed 42 \
--fp16 \
--num-layers 24 \
--max-layers 48 \
--budget 22800 \
--num-iters 2000 \
--dist-url tcp://127.0.0.1:9032 \
--token-micro-batch-size 1 \
--world-size 2 --pipeline-group-size 2 --data-group-size 1 \
--pp-mode pipe_sync_sample_mask_token_pipe \
--infer-data ${file}"
(trap 'kill 0' SIGINT; \
python dist_inference_runner.py $(echo ${ARGS}) --cuda-id 0 --rank 0 \
&
python dist_inference_runner.py $(echo ${ARGS}) --cuda-id 1 --rank 1 \
&
# python dist_inference_runner.py $(echo ${ARGS}) --cuda-id 2 --rank 2 \
# &
# python dist_inference_runner.py $(echo ${ARGS}) --cuda-id 3 --rank 3 \
# &
# python dist_inference_runner.py $(echo ${ARGS}) --cuda-id 4 --rank 4 \
# &
# python dist_inference_runner.py $(echo ${ARGS}) --cuda-id 5 --rank 5 \
# &
# python dist_inference_runner.py $(echo ${ARGS}) --cuda-id 6 --rank 6 \
# &
# python dist_inference_runner.py $(echo ${ARGS}) --cuda-id 7 --rank 7 \
# & \
wait)
@2455DD @ariellubonja, could you give me some ideas? I'm still encountering some NCCL-related errors.
Hi @rhmaaa ! Can you try to re-create the Docker container from the Dockerfile? It sounds like a library is missing-type error
How to run on a single GPU? The codes run with 8 GPUs and use distributed training. I can't find a single GPU interface(no sh file for one GPU, and no single GPU run command line).