Slow performance when using YOLO

joansc commented 4 months ago

Hello!

I have started doing some early tests with this amazing and proimising project. However, Im not sure why Im having some slow performance when trying to implement the yolo pose tracking... For reference, Im using the same ExampleReceive.toe and same TopChopDatIO.tox with a slight change of having the input video of people passing by as you can see here:

Then, Im using a yolov8n-pose.engine with resolution of 640x320. Here's the code for the yolo.py script:

import keyboard # optional, used to quit the loop
import numpy as np
import torch
import touchpy as tp
from ultralytics import YOLO

#model = YOLO("yolov8n-pose.pt")  # load an official model
#model.export(format="engine", imgsz=(320,640))
#exit()

# tp.init_logging(level=tp.LogLevel.INFO, console=True, file=True)

torch.cuda.set_device(0)

class ExampleRunComp:
    def __init__(self):
        #self.running = True # used to gracefully exit the loop
        #self.device = torch.device('cuda')
        self.model = YOLO('yolov8n-pose.engine')

    @staticmethod
    def on_layout_change(comp, this):
        print('layout changed:')
        print('in tops:', comp.in_tops.names)
        print('out tops:', comp.out_tops.names)
        print('in chops:', comp.in_chops.names)
        print('out chops:', comp.out_chops.names)
        print('in dats:', comp.in_dats.names)
        print('out dats:', comp.out_dats.names)
        print('pars:', comp.par.names)
        # comp.out_tops[1].set_cuda_flags(tp.CudaFlags.BGRA | tp.CudaFlags.HWC)
        comp.out_tops['topOut2'].set_cuda_flags(tp.CudaFlags.BGR)

        # comp.par['Openwindow'].val = True
        return

    @staticmethod
    def on_frame(comp, this):

        if (keyboard.is_pressed('q') and keyboard.is_pressed('ctrl')):
            comp.stop() # stop running the comp
            return

        webcam_tensor = comp.out_tops['topOut2'].as_tensor() 

        comp.start_next_frame()
        results = this.model(webcam_tensor.unsqueeze(0),stream=True, device=0, max_det=5)
        result = next(results)

        if result is not None:
            annotatedArray = result.plot(boxes=False, labels=False)
            tensor = torch.from_numpy(annotatedArray).cuda()
            comp.in_tops['topIn1'].from_tensor(tensor, flags=tp.CudaFlags.BGR) 

    def runComp(self, tox_path):
        # create a comp object and specify a path to a tox file
        #comp = tp.Comp(tox_path)
        comp = tp.Comp(tox_path, flags=tp.CompFlags.INTERNAL_TIME_AUTO)

        comp.set_on_layout_change_callback(self.on_layout_change, self)
        comp.set_on_frame_callback(self.on_frame, self)

        comp.start() # start the comp, blocks with CompFlags.InternalTimeAuto and CompFlags.InternalTimeSemiAuto

        comp.unload() # should be called to properly unload the comp (especially if Python exits immediately after this)
        pass

# create an instance of a class that runs the comp
example = ExampleRunComp()

# run the comp
example.runComp('TopChopDatIO.tox')

When I run the script it seems everything is working fine:

However, when I check the syphonout1 on ExampleReceive the stream seems slow as you can see in the next video... After seeing your presentation, when you did the demo, I see its going pretty fast, thats why its not making sense to me... Also it seems from the prints on the console that the processing is fast...

https://github.com/IntentDev/touchpy/assets/17720862/9b2e711c-ddef-408d-8936-de193a70455b

Any idea what could it be? Im on pc windows 11, using td 2023.11600, rtx 4090

Thanks in advance,

Joan

UnveilStudio commented 4 months ago

You should avoid plot function, and use stream = True ,stream_buffer = True this should avoid moemory leaks that is what is causing slow down.

Than make an environment with cu12 rt10 ultralytics 8.2 make a new engine... i inference at 60 fps a full hd :-)

You can write me on discord i ll share my project. Or just check ultralytics library result named attributes

joansc commented 4 months ago

Hey thanks for your response! I do have an environment with torch 2.31, cu121, trt 10, ultralytics 8.2,.. I added stream_buffer=True but still the same... I know it should work fast bcs Idzard did the demo live and he was using the most heavy pose model with size of 640x640 and tracking more than 5 people and using the plot function...

UnveilStudio commented 4 months ago

Yea but the plot function should be avoided. Are u sure yolo is getting the right device ? Like what happen if you pass device = 0 ?

Simone Franco

Il giorno gio 13 giu 2024 alle 12:48 Joan Sandoval @.***> ha scritto:

Hey thanks for your response! I do have an environment with torch 2.31, cu121, trt 10, ultralytics 8.2,.. I added stream_buffer=True but still the same... I know it should work fast bcs Idzard did the demo live and he was using the most heavy pose model with size of 640x640 and tracking more than 5 people and using the plot function...

— Reply to this email directly, view it on GitHub https://github.com/IntentDev/touchpy/issues/3#issuecomment-2165297312, or unsubscribe https://github.com/notifications/unsubscribe-auth/ARMW6PW67XK3II72H7NMYTLZHF2JDAVCNFSM6AAAAABJIBLDQCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNRVGI4TOMZRGI . You are receiving this because you commented.Message ID: @.***>

keithlostracco commented 4 months ago

I don't see any reason for in your script for such bad performance on the Spout receive in TD other than the API usage error TRT is reporting about the shape being wrong? Also I wouldn't trust those processing time numbers coming from Yolo, I would time the whole callback to verify the issue is happening there.

joansc commented 4 months ago

So if I time only the plot function I get around 24ms of processing... So I guess as @UnveilStudio said it should be avoided. What I don't get then is why in your demos @keithlostracco you didn't have this problem, as you were using a bigger res, tracking more people and using the heaviest model... im on a pc tower i9-14 4090, so I would discard bcs pc specs...

keithlostracco commented 4 months ago

I don't remember the plot function being that slow, seems strange.

In either case you could avoid it altogether if you just get the joint data out as a numpy array, copy it to an in_chop and use the data to draw your own joints and skeleton with instancing in the Touch component. It will be way faster than the OpenCV functions and cpu mat that Yolo is using...

In this case I don't think there is an issue with TouchPy so I'm going to transfer this thread to a discussion (see tab at the top of the page).

IntentDev / touchpy

Slow performance when using YOLO #3