WongKinYiu / yolov7

Implementation of paper - YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors
GNU General Public License v3.0
13.33k stars 4.2k forks source link

Memory leak #783

Open awarebayes opened 2 years ago

awarebayes commented 2 years ago

This code contains a memory leak.

I tried looking at memory usage with nvidia-smi, and with each loop pass this function consumes more and more memory without it being free'd.

Just try running some keypoints on a video.

I try running the following code:

import matplotlib.pyplot as plt
import torch
import cv2
from torchvision import transforms
import numpy as np
from utils.datasets import letterbox
from utils.general import non_max_suppression_kpt
from utils.plots import output_to_keypoint, plot_skeleton_kpts
import gc

# %%
device = torch.device("cuda:0")
weigths = torch.load('yolov7-w6-pose.pt', map_location=device)
model = weigths['model']
_ = model.float().eval()

if torch.cuda.is_available():
    model.half().to(device)

cap = cv2.VideoCapture('/videos/test_video.mp4')

if (cap.isOpened()== False): 
  print("Error opening video stream or file")

# Read until video is completed
while(cap.isOpened()):
  # Capture frame-by-frame
  ret, image = cap.read()
  if ret == True:

    image = letterbox(image, 960, stride=64, auto=True)[0]
    image = transforms.ToTensor()(image)
    image = torch.tensor(np.array([image.numpy()]))

    if torch.cuda.is_available():
        image = image.half().to(device)   
    output, _ = model(image)
    output = non_max_suppression_kpt(output, 0.25, 0.65, nc=model.yaml['nc'], nkpt=model.yaml['nkpt'], kpt_label=True)

    with torch.no_grad():
        output = output_to_keypoint(output)
    nimg = image[0].permute(1, 2, 0) * 255
    nimg = nimg.cpu().numpy().astype(np.uint8)
    for idx in range(output.shape[0]):
        plot_skeleton_kpts(nimg, output[idx, 7:].T, 3)

    # Display the resulting frame
    cv2.imshow('Frame', nimg)
    gc.collect()

    # Press Q on keyboard to  exit
    if cv2.waitKey(25) & 0xFF == ord('q'):
      break

  # Break the loop
  else: 
    break
awarebayes commented 2 years ago

testing it with the master branch

austinulfers commented 2 years ago

Any news on this? Wondering if this has to do with cuda memory errors that people have been seeing. https://github.com/WongKinYiu/yolov7/issues/865 as one example.

StefanCiobanu1989 commented 2 years ago

change :

# Read until video is completed
while(cap.isOpened()):
  # Capture frame-by-frame

to

# Read until video is completed
with torch.no_grad():
         while(cap.isOpened()):
          # Capture frame-by-frame

and try removing the " with torch.no_grad():" from

with torch.no_grad():
        output = output_to_keypoint(output)

I used to have my 2080ti memory usage maxed out and now it doesn't go above 4GB while inferring. Hope this helps you.

The code that leads to the leak can be found in general.py line 628.

def non_max_suppression(prediction, conf_thres=0.1, iou_thres=0.45, classes=None, agnostic=False, multi_label=False,
                        labels=()):
    """Runs Non-Maximum Suppression (NMS) on inference results

    Returns:
         list of detections, on (n,6) tensor per image [xyxy, conf, cls]
    """

    nc = prediction.shape[2] - 5  # number of classes
    xc = prediction[..., 4] > conf_thres  # candidates

    # Settings
    min_wh, max_wh = 2, 4096  # (pixels) minimum and maximum box width and height
    max_det = 300  # maximum number of detections per image
    max_nms = 30000  # maximum number of boxes into torchvision.ops.nms()
    time_limit = 10.0  # seconds to quit after
    redundant = True  # require redundant detections
    multi_label &= nc > 1  # multiple labels per box (adds 0.5ms/img)
    merge = False  # use merge-NMS

    t = time.time()

    #The line below leads to memory leak
    output = [torch.zeros((0, 6), device=prediction.device)] * prediction.shape[0]
    for xi, x in enumerate(prediction):  # image index, image inference
        # Apply constraints
        # x[((x[..., 2:4] < min_wh) | (x[..., 2:4] > max_wh)).any(1), 4] = 0  # width-height
        x = x[xc[xi]]  # confidence

        # Cat apriori labels if autolabelling
        if labels and len(labels[xi]):
            l = labels[xi]
            v = torch.zeros((len(l), nc + 5), device=x.device)
            v[:, :4] = l[:, 1:5]  # box
            v[:, 4] = 1.0  # conf
            v[range(len(l)), l[:, 0].long() + 5] = 1.0  # cls
            x = torch.cat((x, v), 0)

        # If none remain process next image
        if not x.shape[0]:
            continue

        # Compute conf
        if nc == 1:
            x[:, 5:] = x[:, 4:5] # for models with one class, cls_loss is 0 and cls_conf is always 0.5,
                                 # so there is no need to multiplicate.
        else:
            x[:, 5:] *= x[:, 4:5]  # conf = obj_conf * cls_conf

        # Box (center x, center y, width, height) to (x1, y1, x2, y2)
        box = xywh2xyxy(x[:, :4])

        # Detections matrix nx6 (xyxy, conf, cls)
        if multi_label:
            i, j = (x[:, 5:] > conf_thres).nonzero(as_tuple=False).T
            x = torch.cat((box[i], x[i, j + 5, None], j[:, None].float()), 1)
        else:  # best class only
            conf, j = x[:, 5:].max(1, keepdim=True)
            x = torch.cat((box, conf, j.float()), 1)[conf.view(-1) > conf_thres]

        # Filter by class
        if classes is not None:
            x = x[(x[:, 5:6] == torch.tensor(classes, device=x.device)).any(1)]

        # Apply finite constraint
        # if not torch.isfinite(x).all():
        #     x = x[torch.isfinite(x).all(1)]

        # Check shape
        n = x.shape[0]  # number of boxes
        if not n:  # no boxes
            continue
        elif n > max_nms:  # excess boxes
            x = x[x[:, 4].argsort(descending=True)[:max_nms]]  # sort by confidence

        # Batched NMS
        c = x[:, 5:6] * (0 if agnostic else max_wh)  # classes
        boxes, scores = x[:, :4] + c, x[:, 4]  # boxes (offset by class), scores
        i = torchvision.ops.nms(boxes, scores, iou_thres)  # NMS
        if i.shape[0] > max_det:  # limit detections
            i = i[:max_det]
        if merge and (1 < n < 3E3):  # Merge NMS (boxes merged using weighted mean)
            # update boxes as boxes(i,4) = weights(i,n) * boxes(n,4)
            iou = box_iou(boxes[i], boxes) > iou_thres  # iou matrix
            weights = iou * scores[None]  # box weights
            x[i, :4] = torch.mm(weights, x[:, :4]).float() / weights.sum(1, keepdim=True)  # merged boxes
            if redundant:
                i = i[iou.sum(1) > 1]  # require redundancy

        output[xi] = x[i]
        if (time.time() - t) > time_limit:
            print(f'WARNING: NMS time limit {time_limit}s exceeded')
            break  # time limit exceeded

    return output
lhphanto commented 1 year ago

Thanks @StefanCiobanu1989 for the solution! May I know why output = [torch.zeros((0, 6), device=prediction.device)] * prediction.shape[0] leads to memory leak? Currently, I easily get OOM during training and curious if this is related.

TiagoGouvea commented 1 year ago

I'm having memory leaks with 10 images 640 pixel wide on a 16GB M1 computer.

python train.py --weights yolob7.py --data "data/custom.yaml" --workers 4 --batch-size 4 --img 4096 --cfg cfg/training/yolov7.yaml --name yolov7 --hyp data/hyp.scratch.p5.yaml

So it starts to process, create the init.pt file and after some seconds..

[1]    22575 killed     python3 train.py --weights yolob7.py --data "data/custom.yaml" --workers 4  4
/Users/tiagogouvea/anaconda3/envs/py310/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 41 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '

I saw the solution proposed by @StefanCiobanu1989 but, sorry, I cant find how to fix it. I cant find the while(cap.isOpened()): code, and on the "with torch.no_grad():" I changed it but still having the error.

Merwanski commented 1 year ago

The _"with torch.nograd():" statement is used in PyTorch to temporarily disable gradient calculation. This is particularly useful when you're performing inference and can lead to faster and more memory-efficient computations.

Try the following block of code

import cv2
import torch

# Load your trained model
model = ...  # Load your PyTorch model

# Set the model to evaluation mode
model.eval()

# Open the video capture
video_path = 'path_to_your_video.mp4'
cap = cv2.VideoCapture(video_path)

with torch.no_grad():
    while cap.isOpened():
        ret, frame = cap.read()

        if not ret:
            break

        # Preprocess the frame if needed        
        # Convert the frame to a tensor (assuming you have a suitable function for this)
        frame_tensor = ...  # Convert the frame to a PyTorch tensor

        # Perform inference using the model
        output = model(frame_tensor)

        # Process the output if needed       
        # Display or save the processed frame     
        # Press 'q' to exit
        if cv2.waitKey(1) & 0xFF == ord('q'):
            break

# Release the video capture and close the windows
cap.release()
cv2.destroyAllWindows()