Pytorch 2.4.0 Ultralytics/YOLO - does not work with OpenCL backend

Skillnoob commented 4 weeks ago

I tried to run Ultralytics using the most recent release.

Minimal example to reproduce:

from ultralytics import YOLO
import pytorch_ocl

model = YOLO('yolov8n.pt')

model.val(data='coco8.yaml', batch=1)

This line needs to be modified to return torch.device('ocl:0'), otherwise Ultralytics will complain about passing a wrong device or only run on the CPU.

My GPU: Radeon RX 7900 GRE

Full log:

Ultralytics YOLOv8.2.74 🚀 Python-3.11.9 torch-2.4.0+cpu CPU (AMD Ryzen 7 7800X3D 8-Core Processor)
Accessing device #0:gfx1100 on AMD Accelerated Parallel Processing
C:\Users\Skillnoob_\AppData\Roaming\Python\Python311\site-packages\ultralytics\utils\torch_utils.py:245: UserWarning: The operator 'aten::mm.out' is not currently supported on the ocl backend. Please open an issue at for requesting support https://github.com/artyom-beilis/pytorch_dlprim/issues (Triggered internally at C:\Users\artik\Projects\build_env\pytorch_dlprim\src\tensor_ops.cpp:336.)
  fusedconv.weight.copy_(torch.mm(w_bn, w_conv).view(fusedconv.weight.shape))
C:\Users\Skillnoob_\AppData\Roaming\Python\Python311\site-packages\ultralytics\utils\torch_utils.py:250: UserWarning: The operator 'aten::mm.out' is not currently supported on the ocl backend. Please open an issue at for requesting support https://github.com/artyom-beilis/pytorch_dlprim/issues (Triggered internally at C:\Users\artik\Projects\build_env\pytorch_dlprim\src\tensor_ops.cpp:336.)
  fusedconv.bias.copy_(torch.mm(w_bn, b_conv.reshape(-1, 1)).reshape(-1) + b_bn)

Process finished with exit code -1073741819 (0xC0000005)

Skillnoob commented 4 weeks ago

After modifying the code to this:

from ultralytics import YOLO
import pytorch_ocl

def fuse(self, *args, **kwargs):
    return self

model = YOLO('yolov8n.pt')

model.model.fuse = fuse.__get__(model.model, type(model.model))

model.val(data='coco8.yaml', batch=1)

I get the following error log:


Ultralytics YOLOv8.2.74 🚀 Python-3.11.9 torch-2.4.0+cpu CPU (AMD Ryzen 7 7800X3D 8-Core Processor)
Accessing device #0:gfx1100 on AMD Accelerated Parallel Processing

Dataset 'coco8.yaml' images not found ⚠️, missing path 'C:\Users\makei\Desktop\fdyxv\datasets\coco8\images\val'
Downloading https://ultralytics.com/assets/coco8.zip to 'C:\Users\makei\Desktop\fdyxv\datasets\coco8.zip'...
100%|██████████| 433k/433k [00:00<00:00, 4.57MB/s]
Unzipping C:\Users\makei\Desktop\fdyxv\datasets\coco8.zip to C:\Users\makei\Desktop\fdyxv\datasets\coco8...: 100%|██████████| 25/25 [00:00<00:00, 3124.48file/s]
val: Scanning C:\Users\makei\Desktop\fdyxv\datasets\coco8\labels\val... 4 images, 0 backgrounds, 0 corrupt: 100%|██████████| 4/4 [00:00<00:00, 444.34it/s]
Dataset download success ✅ (1.3s), saved to C:\Users\makei\Desktop\fdyxv\datasets

val: New cache created: C:\Users\makei\Desktop\fdyxv\datasets\coco8\labels\val.cache
C:\Users\makei\miniconda3\envs\ultralytics\Lib\site-packages\torch\nn\functional.py:796: UserWarning: The operator 'aten::max_pool2d_with_indices.out' is not currently supported on the ocl backend. Please open an issue at for requesting support https://github.com/artyom-beilis/pytorch_dlprim/issues (Triggered internally at C:\Users\artik\Projects\build_env\pytorch_dlprim\src\tensor_ops.cpp:336.)
  return torch.max_pool2d(input, kernel_size, stride, padding, dilation, ceil_mode)
C:\Users\makei\miniconda3\envs\ultralytics\Lib\site-packages\torch\nn\functional.py:4050: UserWarning: The operator 'aten::upsample_nearest2d.out' is not currently supported on the ocl backend. Please open an issue at for requesting support https://github.com/artyom-beilis/pytorch_dlprim/issues (Triggered internally at C:\Users\artik\Projects\build_env\pytorch_dlprim\src\tensor_ops.cpp:336.)
  return torch._C._nn.upsample_nearest2d(input, output_size, scale_factors)
C:\Users\makei\AppData\Roaming\Python\Python311\site-packages\ultralytics\utils\tal.py:303: UserWarning: The operator 'aten::arange.start_out' is not currently supported on the ocl backend. Please open an issue at for requesting support https://github.com/artyom-beilis/pytorch_dlprim/issues (Triggered internally at C:\Users\artik\Projects\build_env\pytorch_dlprim\src\tensor_ops.cpp:336.)
  sx = torch.arange(end=w, device=device, dtype=dtype) + grid_cell_offset  # shift x
Traceback (most recent call last):
  File "C:\Users\makei\Desktop\opencl testing\main.py", line 17, in <module>
    main()
  File "C:\Users\makei\Desktop\opencl testing\main.py", line 13, in main
    model.val(data='coco8.yaml', batch=1)
  File "C:\Users\makei\AppData\Roaming\Python\Python311\site-packages\ultralytics\engine\model.py", line 644, in val
    validator(model=self.model)
  File "C:\Users\makei\miniconda3\envs\ultralytics\Lib\site-packages\torch\utils\_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\makei\AppData\Roaming\Python\Python311\site-packages\ultralytics\engine\validator.py", line 157, in __call__
    model.warmup(imgsz=(1 if pt else self.args.batch, 3, imgsz, imgsz))  # warmup
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\makei\AppData\Roaming\Python\Python311\site-packages\ultralytics\nn\autobackend.py", line 639, in warmup
    self.forward(im)  # warmup
    ^^^^^^^^^^^^^^^^
  File "C:\Users\makei\AppData\Roaming\Python\Python311\site-packages\ultralytics\nn\autobackend.py", line 456, in forward
    y = self.model(im, augment=augment, visualize=visualize, embed=embed)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\makei\miniconda3\envs\ultralytics\Lib\site-packages\torch\nn\modules\module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\makei\miniconda3\envs\ultralytics\Lib\site-packages\torch\nn\modules\module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\makei\AppData\Roaming\Python\Python311\site-packages\ultralytics\nn\tasks.py", line 102, in forward
    return self.predict(x, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\makei\AppData\Roaming\Python\Python311\site-packages\ultralytics\nn\tasks.py", line 120, in predict
    return self._predict_once(x, profile, visualize, embed)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\makei\AppData\Roaming\Python\Python311\site-packages\ultralytics\nn\tasks.py", line 141, in _predict_once
    x = m(x)  # run
        ^^^^
  File "C:\Users\makei\miniconda3\envs\ultralytics\Lib\site-packages\torch\nn\modules\module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\makei\miniconda3\envs\ultralytics\Lib\site-packages\torch\nn\modules\module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\makei\AppData\Roaming\Python\Python311\site-packages\ultralytics\nn\modules\head.py", line 60, in forward
    y = self._inference(x)
        ^^^^^^^^^^^^^^^^^^
  File "C:\Users\makei\AppData\Roaming\Python\Python311\site-packages\ultralytics\nn\modules\head.py", line 93, in _inference
    self.anchors, self.strides = (x.transpose(0, 1) for x in make_anchors(x, self.stride, 0.5))
                                                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\makei\AppData\Roaming\Python\Python311\site-packages\ultralytics\utils\tal.py", line 303, in make_anchors
    sx = torch.arange(end=w, device=device, dtype=dtype) + grid_cell_offset  # shift x
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: Buffer is not valid for unallocated defvice```

artyom-beilis commented 4 weeks ago

I tried to run Ultralytics using the most recent release.

Ok Yolo was one of the next thing in my todo list - validate it works.

The operator 'aten::mm.out' is not currently supported on the ocl backend.

Ok this is going to be an easy to fix. I surprised that there is yet another gemm operator

aten::max_pool2d_with_indices.out

On this is little bit more complicated, my internal implementation does not use indices but I assume it shouldn't be complex.

 File "C:\Users\makei\AppData\Roaming\Python\Python311\site-packages\ultralytics\utils\tal.py", line 303, in make_anchors
    sx = torch.arange(end=w, device=device, dtype=dtype) + grid_cell_offset  # shift x
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: Buffer is not valid for unallocated defvice

Ok... This is something I need to check.

artyom-beilis commented 3 weeks ago

Just to update... Validating YOLO is important but hard.

Now I discovered that you can torch.cat different types and result type is not really documented... Din't expected to concatenate long and float tensors.

Many other things that generated failure have been fixed but there is still a lot to do...

artyom-beilis commented 3 weeks ago

Ok... it is interesting chalange

I sorted out finally issue with serialization and deserialization (torch.load/torch.save) - added missing allocatior registration - so no need this fuse workaround any more
Fixed some issues with view operator, cat, resize_ and some other trivial but annoying things
I managed to run training and validation and it looks Ok.

In order to make it run I needed to fix small things in their code (they ignore half = False and some other stuff) but in general I managed to complete the run (see fixed in diff below)

BUT - lots of operators are fallbacking to CPU... (See list below) some are easy and can be implemented with simple boradcast/reduce/pointwise operators but some are little bit trickier and some I don't even have an idea what are they doing.

Here is the list

Who wants to give a hand implementing them?

'aten::addmv.out'
'aten::all.all_out'
'aten::amax.out'
'aten::amin.out'
'aten::atan.out'
'aten::bitwise_not.out'
'aten::bitwise_or.Tensor_out'
'aten::gt.Tensor_out'
'aten::_index_put_impl_'
'aten::index.Tensor_out'
'aten::le.Tensor_out'
'aten::linalg_vector_norm.out'
'aten::log_sigmoid_forward'
'aten::lt.Tensor_out'
'aten::masked_fill_.Scalar'
'aten::max.dim_max'
'aten::maximum.out'
'aten::max_pool2d_with_indices.out'
'aten::minimum.out'
'aten::mm.out'
'aten::nonzero'
'aten::pow.Tensor_Scalar_out'
'aten::prod.int_out'
'aten::scatter_add.out'
'aten::scatter.value_out'
'aten::sort.values_stable'
'aten::topk.values'
'aten::unfold'
'aten::_unique2'
'aten::upsample_nearest2d_backward.grad_input'
'aten::upsample_nearest2d.out'
'aten::where.self'
'torchvision::nms'

These are changes in ultranalitics code

--- venv/pt_rocm/lib/python3.10/site-packages/ultralytics/utils/torch_utils.py     2024-08-18 23:57:25.125804942 +0300
+++ venv/pt_cpu_2.4/lib/python3.10/site-packages/ultralytics/utils/torch_utils.py  2024-08-19 22:16:54.377978652 +0300
@@ -156,7 +156,8 @@
         device = device.replace(remove, "")  # to string, 'cuda:0' -> '0' and '(0, 1)' -> '0,1'
     cpu = device == "cpu"
     mps = device in {"mps", "mps:0"}  # Apple Metal Performance Shaders (MPS)
-    if cpu or mps:
+    ocl = device.find('ocl')==0
+    if cpu or mps or ocl:
         os.environ["CUDA_VISIBLE_DEVICES"] = "-1"  # force torch.cuda.is_available() = False
     elif device:  # non-cpu device requested
         if device == "cuda":
--- venv/pt_rocm/lib/python3.10/site-packages/ultralytics/engine/validator.py      2024-08-18 23:57:25.118804644 +0300
+++ venv/pt_cpu_2.4/lib/python3.10/site-packages/ultralytics/engine/validator.py   2024-08-19 22:21:07.850314250 +0300
@@ -112,7 +112,7 @@
         if self.training:
             self.device = trainer.device
             self.data = trainer.data
-            self.args.half = self.device.type != "cpu"  # force FP16 val during training
+            self.args.half = self.args.half and self.device.type != "cpu"  # force FP16 val during training
             model = trainer.ema.ema or trainer.model
             model = model.half() if self.args.half else model.float()
             # self.model = model

tangjinchuan commented 3 weeks ago

Not so fast my friend, more to go:

lerp.Scalar_out native_dropout gather.out index_select upsample_bilinear2d.out

artyom-beilis commented 3 weeks ago

Not so fast my friend, more to go...

Indeed - lots of operators... btw mm.out on its way

artyom-beilis commented 2 weeks ago

Updates: following are done: mm, bmm, amax, amin, native_dropout, arange, resize_ . fixes in some other operators - allow softmax/logsoftmax work on multiple dimensions (performance issue with gelu...

More to go...

Skillnoob commented 2 weeks ago

I've created a fork of Ultralytics here, which adds more proper pytorch_ocl support to Ultralytics. The draft PR can be found here, since support on the pytorch_ocl side is not yet fully finished and the current release is incompatible. Code example how validation would be run now:

from ultralytics import YOLO

model = YOLO('yolov8n.pt')

model.val(data='coco8.yaml', batch=1, device="ocl")

The device can be either ocl, which defaults to ocl:0 or the regular ocl:<device number>.

artyom-beilis / pytorch_dlprim

Pytorch 2.4.0 Ultralytics/YOLO - does not work with OpenCL backend #84