Open mpatel31415 opened 3 months ago
I reproduced the issue manually on a cluster - here you can find full logs: slurm-930652.txt
@vedaanta it sounds like we don't have enough memory to run these models, so we can't expect this to work.
But we shouldn't be crashing, and for some reason it seems we only crash when the cuDNN executor is enabled. Could you follow up here?
Hi all! I wrote recently that the issue it fixed - but I checked it only for one model (Gemma-7b). The error is still present (checked on INTERNALIMAGE:pjnl-20240830 for Mistral-7B-v0.2, longchat-13b-16k and vicuna-7b-v1.5-16k):
Here are the newest reproduction information:
Please use: 2 nodes, each with 8 GPUs. Image "INTERNAL_IMAGE:pjnl-20240830"
Training script:
python /opt/pytorch/lightning-thunder/thunder/benchmarks/benchmark_litgpt.py \
--model_name Mistral-7B-v0.2 \
--distributed_mode fsdp \
--shard_mode zero2 \
--compile thunder \
--checkpoint_activations True \
--low_precision_mode fp8-delayed-te \
--micro_batch_size 1
Environment: system.device_product_name DGXH100 system.gpu_driver_version 535.129.03 libraries.cuda 12.6.1.005 libraries.pip.lightning 2.4.0.dev20240728 libraries.pip.lightning-thunder 0.2.0.dev0 libraries.pip.lightning-utilities 0.11.6 libraries.pip.litgpt 0.4.11 libraries.pip.nvfuser 0.2.10+git58dfdc1 libraries.pip.pytorch-lightning 2.4.0 libraries.pip.torch 2.5.0a0+git578b8d7 libraries.pip.torchmetrics 1.4.1 libraries.pip.torchvision 0.19.0a0+d23a6e1
In the recent run the issue was present only for 3 cases and 2 models ('CodeLlama-34b-hf', 'falcon-40b') and I checked that it's not present at all for 2 cases (one per model) when I used the newest image, so hopefully it'll gone in the next run.
Hi! So this issue was present recently in 7 cases, all are using fp8, below are reproduction instructions:
Please use:
1 node(s), each with 8 GPUs.
Image "INTERNAL_IMAGE:pjnl-20241011"
Training script:
python /opt/pytorch/lightning-thunder/thunder/benchmarks/benchmark_litgpt.py \
--model_name stablecode-completion-alpha-3b \
--distributed_mode fsdp \
--shard_mode zero3 \
--compile thunder \
--checkpoint_activations False \
--low_precision_mode fp8-delayed-te \
--micro_batch_size 1
@vedaanta this is now the thirdmost important error in terms of killing mixology runs, preventing us from understanding perf for those models. Can this be prioritized in the near term?
Thanks for the bumps, @mpatel31415 and @tfogal. tldr; It seems not be a cudnn issue? Can we get more datapoints which would point to cudnn being the culprit?
There are two images I am running with:
FROM INTERNAL:pjnl-20241011
RUN pip uninstall -y nvidia-cudnn-frontend
I tried out multiple models with both setups of single node and 2 node runs.
With 2 node runs:
- I can reproduce the segmentation fault with pjnl-20241011, with and without cudnn.
- It happens with executors other than thunder_cudnn too
- cudnn is a default executor is thunder now, so specifying thunder_cudnn as done in this bug description should be redundant?
- When dumping cudnn logs without cudnn present in the container, I get nothing. This ensures that no other executor (maybe pytorch native) is using cudnn internally
- When using cudnn executor, I can see the that only sdpa operation is claimed by cudnn.
- The logs seem fine. Nothing that would point to cudnn execution stopping suddenly.
With 1 node run:
- I was never able to reproduce the seg faults
- I have tried various combinations of model names, batch sizes
- My jobs either run into cuda OOM or finish successfully.
(This single node repro is not so important anyway as the 2-node repro is working fine.)
Thanks for the bumps, @mpatel31415 and @tfogal. tldr; It seems not be a cudnn issue? Can we get more datapoints which would point to cudnn being the culprit?
Thanks; looks like I misread back in July. This is an fp8 thing not a cudnn thing. Reassigning.
For the most recent set of issues I used this script to reproduce the error:
#!/bin/bash
#SBATCH -A YOUR_ACCOUNT
#SBATCH -p batch
#SBATCH -J YOUR_JOB_NAME
#SBATCH -N 2
#SBATCH --ntasks-per-node 8
#SBATCH --time 0:29:00
#SBATCH --mail-type=FAIL
#SBATCH --exclusive
IMAGE="INTERNAL_IMAGE:pjnl-20241011"
TRAINING_SCRIPT=$(cat << 'EOF'
set -e
NVFUSER_DISABLE=multidevice python /opt/pytorch/lightning-thunder/thunder/benchmarks/benchmark_litgpt.py \
--max_iters 20 \
--warmup_iters 5 \
--model_name tiny-llama-1.1b \
--distributed_mode ddp \
--shard_mode None \
--compile dynamo_thunder \
--checkpoint_activations False \
--low_precision_mode fp8-delayed-te \
--micro_batch_size 20
EOF
)
srun --export=ALL --container-image=${IMAGE} bash -c "${TRAINING_SCRIPT}"
🐛 Bug
For a few models ( Platypus-30B with FSDP zero3, Gemma7b with DDP and vicuna-33b-v1.3 with FSDP zero3) we get segmentation fault error when trying to use fp8 with thunder_cudnn. When using thunder_cudnn with bf16 only we get OOM error (tested on Gemma7b).
To Reproduce
Please use: 2 nodes, each with 8 GPUs. Image: "INTERNAL_IMAGE:pjnl-20240705"
Training script: python /opt/pytorch/lightning-thunder/thunder/benchmarks/benchmark_litgpt.py \ --model_name Gemma-7b \ --distributed_mode ddp \ --shard_mode None \ --compile thunder_cudnn \ --checkpoint_activations False \ --low_precision_mode fp8-delayed-te \ --micro_batch_size 1"
Expected behavior
We should not get segmentation fault error, but at least OOM error. This issue might be related to FP8 purely, but due to this issue I'm not able to verify it for eager mode.
Environment
system.device_product_name DGXH100 system.gpu_driver_version 535.129.03
libraries.cuda 12.5.1.007
libraries.pip.lightning 2.3.0.dev20240428 libraries.pip.lightning-thunder 0.2.0.dev0 libraries.pip.lightning-utilities 0.11.3.post0 libraries.pip.litgpt 0.3.1 libraries.pip.nvfuser 0.2.6+gitbad998a libraries.pip.pytorch-lightning 2.3.2 libraries.pip.torch 2.5.0a0+git57d05f2 libraries.pip.torchmetrics 1.4.0.post0 libraries.pip.torchvision 0.19.0a0+d23a6e1"
Additional context
I'm attaching full output as a file. output_to_report.txt