aws-neuron / aws-neuron-sdk

Powering AWS purpose-built machine learning chips. Blazing fast and cost effective, natively integrated into PyTorch and TensorFlow and integrated with your favorite AWS services
https://aws.amazon.com/machine-learning/neuron/
Other
465 stars 154 forks source link

Conversion of YOLOv8x torchscript to torch_neuron failing #983

Open Harish-Sundaravel opened 2 months ago

Harish-Sundaravel commented 2 months ago

I’m encountering issues when trying to convert my YOLOv8x model from torchScript to torch_neuron on Kaggle. Here are the details:

  1. YOLOv8x Model (Single Class):

    • Trained model file: '.pt'
    • Conversion to torchscript: Successful
    • Conversion from torchscript to torch_neuron: Completed in 15 minutes, consuming approximately 19GB of RAM.
  2. YOLOv8x Model (Two Classes):

    • Trained model file: '.pt'
    • Conversion to torchscript: Successful
    • Conversion from torchscript to torch_neuron: Takes about 1.5 hours and causes RAM usage to spike to approximately 205GB, ultimately failing.

Code Used for Conversion:

  1. Converting .pt file to torchscript: model = YOLO('my_yolov8x.pt') [# Attempting Half Precision: model.model.half() # Converts from torch.float32 to torch.float16] - tried this too! model.export(format='torchscript', imgsz=1024) #creates my_yolov8x.torchscript file

  2. Converting torchscript to torch_neuron: model = torch.jit.load("/kaggle/input/my_yolov8x.torchscript") model = model.float().eval()

example_input = torch.rand(1, 3, 1024, 1024) neuron_model = torch_neuron.trace(model, example_input) neuron_model.save('my_yolov8x.neuron')

Problem: When converting the model trained with two classes, the process is extremely slow and consumes excessive memory, resulting in failure.

If anyone has insights or solutions to address this issue, your help would be greatly appreciated.

Thank you in advance!