Lightning-AI / lightning-thunder

Make PyTorch models up to 40% faster! Thunder is a source to source compiler for PyTorch. It enables using different hardware executors at once; across one or thousands of GPUs.
Apache License 2.0
1.15k stars 77 forks source link

An error occurred: KeyError – 't5479' / #588

Closed wprazuch closed 3 months ago

wprazuch commented 3 months ago

🐛 Bug

There is an error when running Thunder inductor for lit-gpt models:

  1. dolly-v2-3b; 2 nodes; 8 gpus per node; FSDP zero2 & zero3:

    raise Exception(f"Unexpected error occurred: {result.stderr}")
    An error occurred: KeyError \xe2\x80\x93 \'t5905\'
    Exception: Unexpected error occurred: /usr/local/lib/python3.10/dist-packages/lightning/fabric/utilities/throughput.py:299: mods argument is not needed anymore, you can stop passing it
    [rank0]: KeyError: \'t5905\'
  2. phi-2; 1&2 nodes; 8 gpus per node, FSDP zero2 & zero3:

 0: [rank0]:   File "thunder.backward_fn_259", line 461, in backward_fn
 0: [rank0]:   File "/opt/pytorch/lightning-thunder/thunder/executors/nvfuserex_impl.py", line 402, in __call__
 0: [rank0]:     fd = self.get_fd(to_descriptors(args))
 0: [rank0]:   File "/opt/pytorch/lightning-thunder/thunder/executors/nvfuserex_impl.py", line 512, in get_fd
 0: [rank0]:     return create_fd(bsyms, input_descriptors, sorted_unique_inputs, sorted_unique_outputs)
 0: [rank0]:   File "/opt/pytorch/lightning-thunder/thunder/executors/nvfuserex_impl.py", line 274, in create_fd
 0: [rank0]:     translate_bound_symbol(bsym)
 0: [rank0]:   File "/opt/pytorch/lightning-thunder/thunder/executors/nvfuserex_impl.py", line 264, in translate_bound_symbol
 0: [rank0]:     nvresults = translator(*bsym.args, **bsym.kwargs, fd=fd, lc_to_nv_map=lc_to_nv_map)
 0: [rank0]:   File "/opt/pytorch/lightning-thunder/thunder/executors/nvfuserex_impl.py", line 1790, in mul
 0: [rank0]:     nvb = getnv(b, fd, lc_to_nv_map)
 0: [rank0]:   File "/opt/pytorch/lightning-thunder/thunder/executors/nvfuserex_impl.py", line 116, in getnv
 0: [rank0]:     return lc_to_nv_map[x]
 0: [rank0]:   File "/opt/pytorch/lightning-thunder/thunder/core/utils.py", line 919, in __getitem__
 0: [rank0]:     return self._dict[key_]
 0: [rank0]: KeyError: 't5479'
  1. phi-2; 1 node, 1 gpu per node:

    "    raise Exception(f""Unexpected error occurred: {result.stderr}"")
    An error occurred: KeyError \xe2\x80\x93 \'t5213\'
    Exception: Unexpected error occurred: /usr/local/lib/python3.10/dist-packages/lightning/fabric/utilities/throughput.py:299: mods argument is not needed anymore, you can stop passing it
    KeyError: \'t5213\'"
  2. phi-2, DDP, 1&2 nodes, 8 GPUs per node:

0: [rank0]:   File "thunder.backward_fn_259", line 461, in backward_fn
 0: [rank0]:   File "/opt/pytorch/lightning-thunder/thunder/executors/nvfuserex_impl.py", line 402, in __call__
 0: [rank0]:     fd = self.get_fd(to_descriptors(args))
 0: [rank0]:   File "/opt/pytorch/lightning-thunder/thunder/executors/nvfuserex_impl.py", line 512, in get_fd
 0: [rank0]:     return create_fd(bsyms, input_descriptors, sorted_unique_inputs, sorted_unique_outputs)
 0: [rank0]:   File "/opt/pytorch/lightning-thunder/thunder/executors/nvfuserex_impl.py", line 274, in create_fd
 0: [rank0]:     translate_bound_symbol(bsym)
 0: [rank0]:   File "/opt/pytorch/lightning-thunder/thunder/executors/nvfuserex_impl.py", line 264, in translate_bound_symbol
 0: [rank0]:     nvresults = translator(*bsym.args, **bsym.kwargs, fd=fd, lc_to_nv_map=lc_to_nv_map)
 0: [rank0]:   File "/opt/pytorch/lightning-thunder/thunder/executors/nvfuserex_impl.py", line 1790, in mul
 0: [rank0]:     nvb = getnv(b, fd, lc_to_nv_map)
 0: [rank0]:   File "/opt/pytorch/lightning-thunder/thunder/executors/nvfuserex_impl.py", line 116, in getnv
 0: [rank0]:     return lc_to_nv_map[x]
 0: [rank0]:   File "/opt/pytorch/lightning-thunder/thunder/core/utils.py", line 919, in __getitem__
 0: [rank0]:     return self._dict[key_]
 0: [rank0]: KeyError: 't5213'

To Reproduce

Steps to reproduce the behavior: It was tested on Slurm cluster. You can contact me on Slack for more details

Create a file script.sh:

#!/bin/bash
#SBATCH -A YOUR_DETAILS
#SBATCH -p batch
#SBATCH -J YOUR_DETAILS
#SBATCH -N 2
#SBATCH --ntasks-per-node 8
#SBATCH --time 0:29:00
#SBATCH --mail-type=FAIL
#SBATCH --exclusive

IMAGE="INTERNAL_IMAGE:pjnl-20240607"

TRAINING_CMD="python /opt/pytorch/lightning-thunder/thunder/benchmarks/benchmark_litgpt.py \
 --model_name phi-2 \
 --micro_batch_size 1 \
 --distributed_mode fsdp \
 --shard_mode zero3 \
 --compile thunder_inductor_cat_cudnn 
"

And so on.

After you are logged into Slurm cluster run:

sbatch script.sh

For 1 node 1 gpu case, it is enough to run:

mkdir -p output
docker run --pull=always --gpus all --ipc=host --ulimit memlock=-1 --ulimit stack=67108864  -v $PWD/output:/output -it INTERNAL_IMAGE:pjnl-20240607

python /opt/pytorch/lightning-thunder/thunder/benchmarks/benchmark_litgpt.py --model_name phi-2 --compile thunder_inductor_cat_cudnn --micro_batch_size 1

Expected behavior We should be able to run the training.

Environment As in the docker image, tested on H100.

Additional Info It should be possible to reproduce the error by using torchrun on a multi-gpu device.

torchrun --nproc-per-node=8 /opt/pytorch/lightning-thunder/thunder/benchmarks/benchmark_litgpt.py --model_name phi-2 --compile thunder_inductor_cat_cudnn --distributed_mode fsdp --shard_mode zero2 

cc @apaz-cli @carmocca @crcrpar

IvanYashchuk commented 3 months ago

Thank you, Wojciech, for submitting this issue! The root cause is the same as in https://github.com/Lightning-AI/lightning-thunder/issues/292 and the problem is fixed with https://github.com/Lightning-AI/lightning-thunder/commit/860273231b64da7d9c5b242a62183c637823e3aa. The problem occurs because the container you're using is from the 7th of June, using the latest container should resolve the problem.

IvanYashchuk commented 3 months ago

The error is raised also in the single GPU runs. So I'll remove the distributed label.

wprazuch commented 3 months ago

Thanks Ivan ! We will re-test the configurations with the new container next week - I will also check one of the above configurations with the new container today as a check. I will then close this issue if the error is not present.

Btw. I noted the labels you assigned to the issue - I will keep in mind adding them in the future. Sorry for not adding them this time.

IvanYashchuk commented 3 months ago

Don't worry about the labels! I don't even know if it's possible to add them without "Collaborator" status.

wprazuch commented 3 months ago

@IvanYashchuk I had some time to re-test it - the issue is now solved with latest container version. Thanks for the support! All the best, Voytec