Segmentation fault for fp8 and thunder_cudnn

Lightning-AI / lightning-thunder

Make PyTorch models up to 40% faster! Thunder is a source to source compiler for PyTorch. It enables using different hardware executors at once; across one or thousands of GPUs.

Apache License 2.0

1.18k stars 77 forks source link

Segmentation fault for fp8 and thunder_cudnn #756

Open mpatel31415 opened 3 months ago

mpatel31415 commented 3 months ago

🐛 Bug

For a few models ( Platypus-30B with FSDP zero3, Gemma7b with DDP and vicuna-33b-v1.3 with FSDP zero3) we get segmentation fault error when trying to use fp8 with thunder_cudnn. When using thunder_cudnn with bf16 only we get OOM error (tested on Gemma7b).

To Reproduce

Please use: 2 nodes, each with 8 GPUs. Image: "INTERNAL_IMAGE:pjnl-20240705"

Training script: python /opt/pytorch/lightning-thunder/thunder/benchmarks/benchmark_litgpt.py \ --model_name Gemma-7b \ --distributed_mode ddp \ --shard_mode None \ --compile thunder_cudnn \ --checkpoint_activations False \ --low_precision_mode fp8-delayed-te \ --micro_batch_size 1"

Expected behavior

We should not get segmentation fault error, but at least OOM error. This issue might be related to FP8 purely, but due to this issue I'm not able to verify it for eager mode.

Environment

system.device_product_name DGXH100 system.gpu_driver_version 535.129.03

libraries.cuda 12.5.1.007

libraries.pip.lightning 2.3.0.dev20240428 libraries.pip.lightning-thunder 0.2.0.dev0 libraries.pip.lightning-utilities 0.11.3.post0 libraries.pip.litgpt 0.3.1 libraries.pip.nvfuser 0.2.6+gitbad998a libraries.pip.pytorch-lightning 2.3.2 libraries.pip.torch 2.5.0a0+git57d05f2 libraries.pip.torchmetrics 1.4.0.post0 libraries.pip.torchvision 0.19.0a0+d23a6e1"

Additional context

I'm attaching full output as a file. output_to_report.txt

mpatel31415 commented 3 months ago

I reproduced the issue manually on a cluster - here you can find full logs: slurm-930652.txt

tfogal commented 3 months ago

@vedaanta it sounds like we don't have enough memory to run these models, so we can't expect this to work.

But we shouldn't be crashing, and for some reason it seems we only crash when the cuDNN executor is enabled. Could you follow up here?

mpatel31415 commented 2 months ago

Hi all! I wrote recently that the issue it fixed - but I checked it only for one model (Gemma-7b). The error is still present (checked on INTERNALIMAGE:pjnl-20240830 for Mistral-7B-v0.2, longchat-13b-16k and vicuna-7b-v1.5-16k):

Here are the newest reproduction information:

Please use: 2 nodes, each with 8 GPUs. Image "INTERNAL_IMAGE:pjnl-20240830"

Training script:

python /opt/pytorch/lightning-thunder/thunder/benchmarks/benchmark_litgpt.py \
    --model_name Mistral-7B-v0.2 \
    --distributed_mode fsdp \
    --shard_mode zero2 \
    --compile thunder \
    --checkpoint_activations True \
    --low_precision_mode fp8-delayed-te  \
    --micro_batch_size 1

Environment: system.device_product_name DGXH100 system.gpu_driver_version 535.129.03 libraries.cuda 12.6.1.005 libraries.pip.lightning 2.4.0.dev20240728 libraries.pip.lightning-thunder 0.2.0.dev0 libraries.pip.lightning-utilities 0.11.6 libraries.pip.litgpt 0.4.11 libraries.pip.nvfuser 0.2.10+git58dfdc1 libraries.pip.pytorch-lightning 2.4.0 libraries.pip.torch 2.5.0a0+git578b8d7 libraries.pip.torchmetrics 1.4.1 libraries.pip.torchvision 0.19.0a0+d23a6e1

mpatel31415 commented 1 month ago

In the recent run the issue was present only for 3 cases and 2 models ('CodeLlama-34b-hf', 'falcon-40b') and I checked that it's not present at all for 2 cases (one per model) when I used the newest image, so hopefully it'll gone in the next run.

mpatel31415 commented 2 weeks ago

Hi! So this issue was present recently in 7 cases, all are using fp8, below are reproduction instructions:

Please use:
1 node(s), each with 8 GPUs.
Image "INTERNAL_IMAGE:pjnl-20241011"
Training script:
python /opt/pytorch/lightning-thunder/thunder/benchmarks/benchmark_litgpt.py \
    --model_name stablecode-completion-alpha-3b \
    --distributed_mode fsdp \
    --shard_mode zero3 \
    --compile thunder \
    --checkpoint_activations False \
    --low_precision_mode fp8-delayed-te  \
    --micro_batch_size 1

tfogal commented 2 weeks ago

@vedaanta this is now the thirdmost important error in terms of killing mixology runs, preventing us from understanding perf for those models. Can this be prioritized in the near term?

vedaanta commented 2 weeks ago

Thanks for the bumps, @mpatel31415 and @tfogal. tldr; It seems not be a cudnn issue? Can we get more datapoints which would point to cudnn being the culprit?

There are two images I am running with:

pjnl-20241011
pjnl-20241011 without cudnn frontend A Dockerfile with below contents should be sufficient to not have thunder use cudnn executor.
```
FROM INTERNAL:pjnl-20241011
```

RUN pip uninstall -y nvidia-cudnn-frontend



I tried out multiple models with both setups of single node and 2 node runs.

With 2 node runs:
- I can reproduce the segmentation fault with pjnl-20241011, with and without cudnn.
- It happens with executors other than thunder_cudnn too
    - cudnn is a default executor is thunder now, so specifying thunder_cudnn as done in this bug description should be redundant?
- When dumping cudnn logs without cudnn present in the container, I get nothing. This ensures that no other executor (maybe pytorch native) is using cudnn internally
- When using cudnn executor, I can see the that only sdpa operation is claimed by cudnn.
    - The logs seem fine. Nothing that would point to cudnn execution stopping suddenly.

With 1 node run:
- I was never able to reproduce the seg faults
- I have tried various combinations of model names, batch sizes
- My jobs either run into cuda OOM or finish successfully.
(This single node repro is not so important anyway as the 2-node repro is working fine.)

tfogal commented 2 weeks ago

Thanks for the bumps, @mpatel31415 and @tfogal. tldr; It seems not be a cudnn issue? Can we get more datapoints which would point to cudnn being the culprit?

Thanks; looks like I misread back in July. This is an fp8 thing not a cudnn thing. Reassigning.

mpatel31415 commented 1 week ago

For the most recent set of issues I used this script to reproduce the error:

#!/bin/bash
#SBATCH -A YOUR_ACCOUNT
#SBATCH -p batch
#SBATCH -J YOUR_JOB_NAME
#SBATCH -N 2
#SBATCH --ntasks-per-node 8
#SBATCH --time 0:29:00
#SBATCH --mail-type=FAIL
#SBATCH --exclusive

IMAGE="INTERNAL_IMAGE:pjnl-20241011"

TRAINING_SCRIPT=$(cat << 'EOF'
set -e
NVFUSER_DISABLE=multidevice python /opt/pytorch/lightning-thunder/thunder/benchmarks/benchmark_litgpt.py \
    --max_iters 20 \
    --warmup_iters 5 \
    --model_name tiny-llama-1.1b \
    --distributed_mode ddp \
    --shard_mode None \
    --compile dynamo_thunder \
    --checkpoint_activations False \
    --low_precision_mode fp8-delayed-te \
    --micro_batch_size 20

EOF
)

srun --export=ALL --container-image=${IMAGE} bash -c "${TRAINING_SCRIPT}"