Poor Performance in Torch Compile Mode on c8g.4xlarge Instance

I am running a benchmarking script for the Google T5 Small Text Translation model using both eager and torch.compile modes. However, the compile mode is performing worse than eager mode on a c8g.4xlarge instance (AMI: ami-0d486650b94f4c69b, region: us-east-1), which is unexpected given that compiled mode should typically offer better performance.

import argparse
from transformers import T5Tokenizer, T5Model
import torch
from torch.profiler import profile, record_function, ProfilerActivity
import torch._inductor.config as config

config.cpp.weight_prepack = True
config.freezing = True

def test_inference(mode, num_iter):
    tokenizer = T5Tokenizer.from_pretrained("t5-small")
    model = T5Model.from_pretrained("t5-small")

    input_ids = tokenizer(
        "Studies have been shown that owning a dog is good for you", return_tensors="pt"
    ).input_ids  # Batch size 1
    decoder_input_ids = tokenizer("Studies show that", return_tensors="pt").input_ids  # Batch size 1

    if mode == 'compile':
        model = torch.compile(model)

    with torch.no_grad():
        for _ in range(50):
            outputs = model(input_ids=input_ids, decoder_input_ids=decoder_input_ids)

    with profile(activities=[ProfilerActivity.CPU]) as prof:
        with record_function("model_inference"):
            for _ in range(num_iter):
                outputs = model(input_ids=input_ids, decoder_input_ids=decoder_input_ids)

    print(prof.key_averages().table(sort_by="self_cpu_time_total"))

def main() -> None:
    global m, args
    parser = argparse.ArgumentParser(__doc__)
    parser.add_argument(
        "-m",
        "--mode",
        choices=["eager", "compile"],
        default="eager",
        help="Which test to run.",
    )
    parser.add_argument(
        "-n",
        "--number",
        type=int,
        default=100,
        help="how many iterations to run.",
    )
    args = parser.parse_args()
    test_inference(args.mode, args.number)

if __name__ == "__main__":
    main()

init


export DNNL_DEFAULT_FPMATH_MODE=BF16

export THP_MEM_ALLOC_ENABLE=1

export LRU_CACHE_CAPACITY=1024
export OMP_NUM_THREADS=4

python3 -m pip install transformers

python3 google_t5_small_text_translation.py -n 1 -m eager

python3 google_t5_small_text_translation.py -n 1 -m compile

based on this blog post i can't reproduce the results Accelerated PyTorch inference with torch.compile on AWS Graviton processors

Self CPU time total: 30.509ms for eager mode Self CPU time total: 12.226s for compile mode

                                               Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls

      create_aot_dispatcher_function (dynamo_timed)        24.20%        2.959s        89.28%       10.916s     295.033ms            37  
     _compile.<locals>.compile_inner (dynamo_timed)        12.01%        1.468s        99.80%       12.202s       12.202s             1  
                                         aten::view         7.01%     856.771ms        11.54%        1.411s     314.441us          4487

aws / aws-graviton-getting-started

Poor Performance in Torch Compile Mode on c8g.4xlarge Instance #425