Open 0x6b64 opened 1 year ago
When you train models do you get very verbose statements about the optimizations being performed like the following? If so, how are tackling it?
[2023-03-17 10:18:15,712] torch._inductor.compile_fx: [INFO] Step 3: torchinductor done compiling FORWARDS graph 17
[2023-03-17 10:18:15,712] torch._dynamo.output_graph: [INFO] Step 2: done compiler function debug_wrapper
[2023-03-17 10:18:15,887] torch._dynamo.symbolic_convert: [INFO] Step 1: torchdynamo start tracing forward
[2023-03-17 10:18:15,953] torch._dynamo.symbolic_convert: [INFO] Step 1: torchdynamo done tracing forward (RETURN_VALUE)
[2023-03-17 10:18:15,954] torch._dynamo.output_graph: [INFO] Step 2: calling compiler function debug_wrapper
[2023-03-17 10:18:16,279] torch._inductor.compile_fx: [INFO] Step 3: torchinductor compiling FORWARDS graph 18
[2023-03-17 10:18:16,292] torch._inductor.graph: [INFO] Using FallbackKernel: torch.ops.aten._scaled_dot_product_flash_attention.default
[2023-03-17 10:18:16,293] torch._inductor.utils: [INFO] using triton random, expect difference from eager
[2023-03-17 10:18:16,398] torch._inductor.compile_fx: [INFO] Step 3: torchinductor done compiling FORWARDS graph 18
[2023-03-17 10:18:16,399] torch._dynamo.output_graph: [INFO] Step 2: done compiler function debug_wrapper
[2023-03-17 10:18:16,580] torch._dynamo.symbolic_convert: [INFO] Step 1: torchdynamo start tracing <graph break in forward>
[2023-03-17 10:18:16,620] torch._dynamo.symbolic_convert: [INFO] Step 1: torchdynamo start tracing forward
[2023-03-17 10:18:16,657] torch._dynamo.symbolic_convert: [INFO] Step 1: torchdynamo start tracing extract_features
[2023-03-17 10:18:16,694] torch._dynamo.symbolic_convert: [INFO] Step 1: torchdynamo start tracing extract_features_scriptable
[2023-03-17 10:18:16,722] torch._dynamo.output_graph: [INFO] Step 2: calling compiler function debug_wrapper
[2023-03-17 10:18:16,817] torch._inductor.compile_fx: [INFO] Step 3: torchinductor compiling FORWARDS graph 19
[2023-03-17 10:18:16,820] torch._inductor.graph: [INFO] Using FallbackKernel: aten.cumsum
[2023-03-17 10:18:16,831] torch._inductor.utils: [INFO] using triton random, expect difference from eager
[2023-03-17 10:18:16,912] torch._inductor.compile_fx: [INFO] Step 3: torchinductor done compiling FORWARDS graph 19
[2023-03-17 10:18:16,912] torch._dynamo.output_graph: [INFO] Step 2: done compiler function debug_wrapper
[2023-03-17 10:18:17,079] torch._dynamo.symbolic_convert: [INFO] Step 1: torchdynamo start tracing <graph break in extract_features_scriptable>
[2023-03-17 10:18:17,946] torch._dynamo.symbolic_convert: [INFO] Step 1: torchdynamo done tracing <graph break in extract_features_scriptable> (RETURN_VALUE)
[2023-03-17 10:18:17,960] torch._dynamo.output_graph: [INFO] Step 2: calling compiler function debug_wrapper
[2023-03-17 10:18:22,078] torch._inductor.compile_fx: [INFO] Step 3: torchinductor compiling FORWARDS graph 20
[2023-03-17 10:18:22,138] torch._inductor.utils: [INFO] using triton random, expect difference from eager
[2023-03-17 10:18:24,039] torch._inductor.compile_fx: [INFO] Step 3: torchinductor done compiling FORWARDS graph 20
[2023-03-17 10:18:24,040] torch._dynamo.output_graph: [INFO] Step 2: done compiler function debug_wrapper
[2023-03-17 10:18:24,350] torch._dynamo.symbolic_convert: [INFO] Step 1: torchdynamo start tracing <graph break in forward>
[2023-03-17 10:18:24,353] torch._dynamo.symbolic_convert: [INFO] Step 1: torchdynamo done tracing <graph break in forward> (RETURN_VALUE)
[2023-03-17 10:18:24,354] torch._dynamo.output_graph: [INFO] Step 2: calling compiler function debug_wrapper
[2023-03-17 10:18:24,375] torch._inductor.compile_fx: [INFO] Step 3: torchinductor compiling FORWARDS graph 21
[2023-03-17 10:18:24,416] torch._inductor.compile_fx: [INFO] Step 3: torchinductor done compiling FORWARDS graph 21
[2023-03-17 10:18:24,416] torch._dynamo.output_graph: [INFO] Step 2: done compiler function debug_wrapper
[2023-03-17 10:18:24,633] torch._inductor.compile_fx: [INFO] Step 3: torchinductor compiling BACKWARDS graph 21
[2023-03-17 10:18:24,849] torch._inductor.compile_fx: [INFO] Step 3: torchinductor done compiling BACKWARDS graph 21
[2023-03-17 10:18:24,873] torch._inductor.compile_fx: [INFO] Step 3: torchinductor compiling BACKWARDS graph 20
[2023-03-17 10:18:28,375] torch._inductor.compile_fx: [INFO] Step 3: torchinductor done compiling BACKWARDS graph 20
[2023-03-17 10:18:28,378] torch._inductor.compile_fx: [INFO] Step 3: torchinductor compiling BACKWARDS graph 19
[2023-03-17 10:18:28,735] torch._inductor.compile_fx: [INFO] Step 3: torchinductor done compiling BACKWARDS graph 19
[2023-03-17 10:18:28,739] torch._inductor.compile_fx: [INFO] Step 3: torchinductor compiling BACKWARDS graph 18
[2023-03-17 10:18:28,774] torch._inductor.graph: [INFO] Using FallbackKernel: torch.ops.aten._scaled_dot_product_flash_attention_backward.default
[2023-03-17 10:18:29,456] torch._inductor.compile_fx: [INFO] Step 3: torchinductor done compiling BACKWARDS graph 18
[2023-03-17 10:18:29,459] torch._inductor.compile_fx: [INFO] Step 3: torchinductor compiling BACKWARDS graph 17
[2023-03-17 10:18:29,494] torch._inductor.graph: [INFO] Using FallbackKernel: torch.ops.aten._scaled_dot_product_flash_attention_backward.default
[2023-03-17 10:18:29,993] torch._inductor.compile_fx: [INFO] Step 3: torchinductor done compiling BACKWARDS graph 17
🐛 Bug
Hi, I'm training
roberta_large
with DDP withtorch.compile
API wrapping the model definition in trainer. This API is introduced PyTorch 2.0; This error doesn't happen without thetorch.compile
wrapper (so most likely this is a bug with the triton codegen; but given that it only happens with Fairseq && not the other models like huggingface GPT2, Bert_large suggests its worth auditing if fairseq is doing something extraordinary).This is the one line change I've made to Fairseq in trainer model property. https://github.com/facebookresearch/fairseq/blob/main/fairseq/trainer.py#L253
I've also created a ticket in PyTorch issues: https://github.com/pytorch/pytorch/issues/93378
To Reproduce
Steps to reproduce the behavior (always include the command you ran):
I'm working with the wikitext dataset: https://huggingface.co/datasets/wikitext/tree/main
Here is the stacktrace:
Code sample
One line code change to fairseq attached above.
Expected behavior
Expect training to start.
Environment
pip
, source): self installedpip install --editable ./
3.9.13
CUDA Version: 11.7
VERSION="20.04.5 LTS (Focal Fossa)"
NVIDIA A100-SXM4-40GB
GPU.Additional context