[Feature request] Make IoU computation more memory efficient

steve-goley commented 5 years ago

❓ Questions and Help

I'm experiencing high GPU memory usage. I made my own COCO dataset and started training with 2 separate models: e2e_faster_rcnn_R_50_FPN_1x.yaml, e2e_faster_rcnn_R_101_FPN_1x.yaml, e2e_faster_rcnn_X_101_32x8d_FPN_1x.yaml. I changed the number of GPUs to 1 and ran the single GPU training command.

I get well into the training, 100s or 1000s of iterations, then I receive the CUDA OOM message. The reported mem usage is around 7GB though nvidia-smi reports about 9.7GB for the ResNeXt model.

I'm running on a 1080Ti with 11GB of RAM, so it should be able to handle this amount of memory. It seems as though there are periodic peaks in the memory usage.

The error message for R_50_FPN looks like this:

2018-10-25 15:19:50,428 maskrcnn_benchmark.trainer INFO: eta: 4:07:54  iter: 2180  loss: 1.3242 (1.3455)  loss_classifier: 0.4725 (0.5510)  loss_box_reg: 0.2073 (0.1726)  loss_objectness: 0.3853 (0.4184)  loss_rpn_box_reg: 0.2073 (0.2034)  time: 0.1660 (0.1694)  data: 0.0022 (0.0023)  lr: 0.001000  max mem: 6457
Traceback (most recent call last):
  File "/home/sgoley/git/etegent/maskrcnn-benchmark/tools/train_net.py", line 170, in <module>
    main()
  File "/home/sgoley/git/etegent/maskrcnn-benchmark/tools/train_net.py", line 163, in main
    model = train(cfg, args.local_rank, args.distributed)
  File "/home/sgoley/git/etegent/maskrcnn-benchmark/tools/train_net.py", line 73, in train
    arguments,
  File "/home/sgoley/git/etegent/maskrcnn-benchmark/maskrcnn_benchmark/engine/trainer.py", line 67, in do_train
    loss_dict = model(images, targets)
  File "/home/sgoley/miniconda3/envs/pytorchv1/lib/python3.6/site-packages/torch/nn/modules/module.py", line 477, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/sgoley/git/etegent/maskrcnn-benchmark/maskrcnn_benchmark/modeling/detector/generalized_rcnn.py", line 50, in forward
    proposals, proposal_losses = self.rpn(images, features, targets)
  File "/home/sgoley/miniconda3/envs/pytorchv1/lib/python3.6/site-packages/torch/nn/modules/module.py", line 477, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/sgoley/git/etegent/maskrcnn-benchmark/maskrcnn_benchmark/modeling/rpn/rpn.py", line 94, in forward
    return self._forward_train(anchors, objectness, rpn_box_regression, targets)
  File "/home/sgoley/git/etegent/maskrcnn-benchmark/maskrcnn_benchmark/modeling/rpn/rpn.py", line 113, in _forward_train
    anchors, objectness, rpn_box_regression, targets
  File "/home/sgoley/git/etegent/maskrcnn-benchmark/maskrcnn_benchmark/modeling/rpn/loss.py", line 91, in __call__
    labels, regression_targets = self.prepare_targets(anchors, targets)
  File "/home/sgoley/git/etegent/maskrcnn-benchmark/maskrcnn_benchmark/modeling/rpn/loss.py", line 55, in prepare_targets
    anchors_per_image, targets_per_image
  File "/home/sgoley/git/etegent/maskrcnn-benchmark/maskrcnn_benchmark/modeling/rpn/loss.py", line 37, in match_targets_to_anchors
    match_quality_matrix = boxlist_iou(target, anchor)
  File "/home/sgoley/git/etegent/maskrcnn-benchmark/maskrcnn_benchmark/structures/boxlist_ops.py", line 87, in boxlist_iou
    iou = inter / (area1[:, None] + area2 - inter)
RuntimeError: CUDA error: out of memory

Note, I've trained with this dataset on Detectron.pytorch. Any suggestions?

Based on where the error occurs, is it possible that one of my images contains too many targets (potentially hundreds) and the iou calculation blows up?

Steve

fmassa commented 5 years ago

Hi,

There is a difference in how we interpret the IMS_PER_BATCH parameter in our codebase: in Detectron, it's per GPU. In our implementation, it's a global batch size, which gets divided over the number of GPUs that you are using. So in your case, your are probably training with a batch size of 16, on a single GPU.

So to fix your memory issues, you'll need to adapt the IMS_,PER_BATCH, as well as the number of iterations / or schedule / learning rate according to detectron rules.

The reason why we changed the meaning of IMS_PER_BATCH compared to Detectron was indeed to simplify experimentation, as all those parameters I mentioned are fixed given a global batch size. But they need to be adjusted if you change the global batch size, which was the case before whenever you changed the number of GPUs.

Let me know if this is clear

fmassa commented 5 years ago

But that makes me think that it is a good idea to add a note about this in the README. Would you be willing to do it? Thanks!

fmassa commented 5 years ago

@steve-goley I've improved the README in #35 with more details on how to perform experiments with single GPU. Let me know if you still face this problem.

steve-goley commented 5 years ago

Thanks @fmassa for the followup.

I'll keep investigating this. Thanks for the clarification on the IMS_PER_BATCH parameter, I was also confused about it. I believe that I changed that to 1 but still ran into the memory error. I'm able to train for awhile with stable memory usage only to run into an OOM error hundreds or thousands of iterations in.

I'm trying/am going to try a couple of workarounds:

Change MODEL.ROI_HEADS.BATCH_SIZE_PER_IMAGE to 256
Change MODEL.RPN.BATCH_SIZE_PER_IMAGE to 128

Move the boxlist_ops calculation to the CPU. I tried this

iou = inter.cpu() / (area1[:, None].cpu() + area2.cpu() - inter.cpu())
return iou.cuda()

which allowed me to get to 4000+ iterations. However, I eventually errored out here

File "/home/sgoley/git/etegent/maskrcnn-benchmark/maskrcnn_benchmark/structures/boxlist_ops.py", line 84, in boxlist_iou
wh = (rb - lt + TO_REMOVE).clamp(min=0)  # [N,M,2]
RuntimeError: CUDA error: out of memory

If the other changes don't work I'll try moving more of the boxlist_opst to the CPU.

My problems set has some cases of extremely dense GT boxes, >500 in one image. My hypothesis is that this is causing the issue. Does that make sense?

fmassa commented 5 years ago

Our implementation performs bounding box assignment on the GPU, so having an extremely large number of GT boxes might be one of the reasons.

I've did some quick computations, for a batch size of 1, with default parameters for the FPN, there are 242991 anchors. This means that for 500 GT boxes, we have an IoU matrix which itself occupies ~460MB of memory. In order to get this IoU matrix. Thus performing the IoU computation is going to require a lot of memory, maybe in the order of 4GB.

So, in those cases, I think there are a few options:

run the IoU entirely on the CPU -> this will be slow in most cases
have a fused IoU kernel that doesn't require extra buffers -> requires some work and won't be as easy
split the IoU computation in batches of the GT, so that you process at most say 50 GT boxes at once.

I'd start with the first solution, so calling .cpu() everywhere in this part of the code and converting to CUDA just before returning the value. But make sure to get the right CUDA device before converting.

Let me know what you think

steve-goley commented 5 years ago

@fmassa Thanks for your diligence! It sounds like that is indeed my issue. I can't say that I completely understand your second alternative there. I'll see what speed hit I take from moving it to the CPU, or perhaps do so conditionally. If it's too drastic then I will batch it up.

fmassa commented 5 years ago

Let me know if you have issues implementing the batched up implementation. I could give you a hand with that.

About point 2, I was mentioning writing a dedicated CUDA kernel for computing the IoU matrix. This would avoid the temporary buffers from the current implementation, and would probably be faster as well.

I'll think about implementing it

steve-goley commented 5 years ago

@fmassa I did some brief debugging and found that about 200 GT boxes used about 1.5GB of GPU-RAM, roughly in line with your calculations. I'm conditionally using the CPU now for that block if M*N is quite large (>20000000).

Looking at the code, there might be a more memory efficient Python implementation as well, save an MxNx2 allocation by inplace operations. This would increase the threshold but still have its limits.

Sorry, I was slow on the kernel uptake. I thought you meant casting it as a constitutional kernel, which seemed odd. A more memory efficient kernel would be great for my current use case, e.g. overhead imagery.

There are other workarounds (cropping) so it likely shouldn't be the highest item on your list. I'm using 800x800 images, but junk yards and parking lots can pack in a lot of GT targets.

Feel free to close the issue and maybe open it as an enhancement?

fmassa commented 5 years ago

I've changed the title of the issue, let's keep it open.

I think we can use some in-place operations there, and it will bring some savings, but I'm not sure by how much. Chunking is a reasonable compromise as well I think.

Goorman commented 5 years ago

IOU matrix is very often extremely sparse, especially if you immediately remove bbox matches with IOU less than predefined threshold (which might be 0.05 or 0.3 or something else).

Is it a good idea to add iou computation result to be a sparse matrix (or at least add it as an option)?

fmassa commented 5 years ago

There is currently limited support for sparse matrices in PyTorch, so it might not be ideal for now. We currently need max over a dimension for it to work, and I this function is currently not supported yet.

But once we have better support for sparse reductions in PyTorch (sum is in the works), it might be a good idea to implement an optimized C++/CUDA kernels that returns sparse matrices. But that might be non-trivial to do.

zimenglan-sysu-512 commented 5 years ago

hi @fmassa can u do me a favor how to get the right CUDA device before converting to cuda() in these lines? thanks.

fmassa commented 5 years ago

You can do

device = bbox1.device
bbox1 = bbox1.cpu()
...
iou = iou.to(device)

zimenglan-sysu-512 commented 5 years ago

thanks @fmassa.

follow ur instruction, i change the code as below:

# implementation from https://github.com/kuangliu/torchcv/blob/master/torchcv/utils/box.py
# with slight modifications
def boxlist_iou(boxlist1, boxlist2):
    """Compute the intersection over union of two set of boxes.
    The box order must be (xmin, ymin, xmax, ymax).

    Arguments:
      box1: (BoxList) bounding boxes, sized [N,4].
      box2: (BoxList) bounding boxes, sized [M,4].

    Returns:
      (tensor) iou, sized [N,M].

    Reference:
      https://github.com/chainer/chainercv/blob/master/chainercv/utils/bbox/bbox_iou.py
    """
    if boxlist1.size != boxlist2.size:
        raise RuntimeError(
                "boxlists should have same image size, got {}, {}".format(boxlist1, boxlist2))

    N = len(boxlist1)
    M = len(boxlist2)

    area1 = boxlist1.area()
    area2 = boxlist2.area()

    box1, box2 = boxlist1.bbox, boxlist2.bbox

    # see https://github.com/facebookresearch/maskrcnn-benchmark/issues/18
    # https://github.com/facebookresearch/maskrcnn-benchmark/blob/master/maskrcnn_benchmark/structures/boxlist_ops.py#L79-L88
    # I'd start with the first solution, so calling .cpu() everywhere 
    # in this part of the code and converting to CUDA just before returning the value. 
    # But make sure to get the right CUDA device before converting.
    # Here fix the number of gt boxes, and use cpu mode to compute IoU, 
    # then convert to gpu mode w.r.t the device
    if N >= 16: # u can change the number here
        device = box1.device
        box1 = box1.cpu() # ground-truths
        box2 = box2.cpu() # predictions
        lt = torch.max(box1[:, None, :2], box2[:, :2]).cpu()  # [N,M,2]
        rb = torch.min(box1[:, None, 2:], box2[:, 2:]).cpu()  # [N,M,2]

        TO_REMOVE = 1

        wh = (rb - lt + TO_REMOVE).clamp(min=0).cpu()  # [N,M,2]
        inter = wh[:, :, 0] * wh[:, :, 1]  # [N,M]

        iou = inter.cpu() / (area1[:, None].cpu() + area2.cpu() - inter.cpu())
        iou = iou.to(device)
        return iou

    lt = torch.max(box1[:, None, :2], box2[:, :2])  # [N,M,2]
    rb = torch.min(box1[:, None, 2:], box2[:, 2:])  # [N,M,2]

    TO_REMOVE = 1

    wh = (rb - lt + TO_REMOVE).clamp(min=0)  # [N,M,2]
    inter = wh[:, :, 0] * wh[:, :, 1]  # [N,M]

    iou = inter / (area1[:, None] + area2 - inter)
    return iou

but when i run the experiment, i meet this problem

2018-11-14 17:28:05,558 maskrcnn_benchmark.trainer INFO: eta: 18:16:34  iter: 440  loss: 0.5291 (0.6694)  loss_classifier: 0.2842 (0.3785)  loss_box_reg: 0.1833 (0.1970)  loss_objectness: 0.0234 (0.0662)  loss_rpn_box_reg: 0.0200 (0.0277)  time: 0.3697 (0.4072)  data: 0.0049 (0.0071)  lr: 0.009187  max mem: 2813
Traceback (most recent call last):
  File "/usr/lib/python3.6/multiprocessing/queues.py", line 234, in _feed
  File "/usr/lib/python3.6/multiprocessing/reduction.py", line 51, in dumps
  File "/usr/local/lib/python3.6/dist-packages/torch/multiprocessing/reductions.py", line 243, in reduce_storage
RuntimeError: unable to open shared memory object </torch_16296_125707304> in read-write mode
Traceback (most recent call last):
  File "/usr/lib/python3.6/multiprocessing/resource_sharer.py", line 149, in _serve
    send(conn, destination_pid)
  File "/usr/lib/python3.6/multiprocessing/resource_sharer.py", line 50, in send
    reduction.send_handle(conn, new_fd, pid)
  File "/usr/lib/python3.6/multiprocessing/reduction.py", line 176, in send_handle
    with socket.fromfd(conn.fileno(), socket.AF_UNIX, socket.SOCK_STREAM) as s:
  File "/usr/lib/python3.6/socket.py", line 460, in fromfd
    nfd = dup(fd)
OSError: [Errno 24] Too many open files
Traceback (most recent call last):
  File "tools/train_net.py", line 170, in <module>
    main()
  File "tools/train_net.py", line 163, in main
    model = train(cfg, args.local_rank, args.distributed)
  File "tools/train_net.py", line 73, in train
    arguments,
  File "maskrcnn-benchmark/maskrcnn_benchmark/engine/trainer.py", line 60, in do_train
    for iteration, (images, targets, _) in enumerate(data_loader, start_iter):
  File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/dataloader.py", line 631, in __next__
    idx, batch = self._get_batch()
  File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/dataloader.py", line 610, in _get_batch
    return self.data_queue.get()
  File "/usr/lib/python3.6/multiprocessing/queues.py", line 113, in get
    return _ForkingPickler.loads(res)
  File "/usr/local/lib/python3.6/dist-packages/torch/multiprocessing/reductions.py", line 204, in rebuild_storage_fd
    fd = df.detach()
  File "/usr/lib/python3.6/multiprocessing/resource_sharer.py", line 58, in detach
    return reduction.recv_handle(conn)
  File "/usr/lib/python3.6/multiprocessing/reduction.py", line 182, in recv_handle
    return recvfds(s, 1)[0]
  File "/usr/lib/python3.6/multiprocessing/reduction.py", line 155, in recvfds
    raise EOFError
EOFError

i try to set the NUM_WORKERS to 0, it can solve the problem, but slows down the speed.

do u have any suggestions to solve it?

LaoYang1994 commented 5 years ago

You can do

device = bbox1.device
bbox1 = bbox1.cpu()
...
iou = iou.to(device)

I don't think it's a good choice to do like this. I calculate iou in four ways: numpy version, torch version, cython version and gpu version. Indeed, gpu version is fastest. But it costs a lot of memory. Numpy version is close to the torch version but is much slower than cython version. So I suggest using the cython version( can refer to detectron.pytorch)

fmassa commented 5 years ago

You can also use the @torch.jit.script to save memory, or leverage the custom CUDA kernels from https://github.com/facebookresearch/maskrcnn-benchmark/pull/379

LaoYang1994 commented 5 years ago

You can also use the @torch.jit.script to save memory, or leverage the custom CUDA kernels from #379

Thanks. But I'm not familiar with torch.jit.script. Directly wrapping the function is OK?

fmassa commented 5 years ago

Try something like this instead. Note that you'll need to unwrap the boxlist in a separate function, to pass the Tensors directly to this function

@torch.jit.script
def boxes_iou(box1:torch.Tensor, box2:torch.Tensor):
    N = box1.size(0)
    M = box2.size(0)
    b1x1 = box1[:, 0].unsqueeze(1)  # [N,1]
    b1y1 = box1[:, 1].unsqueeze(1)
    b1x2 = box1[:, 2].unsqueeze(1)
    b1y2 = box1[:, 3].unsqueeze(1)
    b2x1 = box2[:, 0].unsqueeze(0)  # [1,N]
    b2y1 = box2[:, 1].unsqueeze(0)
    b2x2 = box2[:, 2].unsqueeze(0)
    b2y2 = box2[:, 3].unsqueeze(0)
    ltx = torch.max(b1x1, b2x1)  # [N,M]
    lty = torch.max(b1y1, b2y1)
    rbx = torch.min(b1x2, b2x2)
    rby = torch.min(b1y2, b2y2)
    TO_REMOVE = 1
    w = (rbx - ltx + TO_REMOVE).clamp(min=0, max=math.inf)  # [N,M]
    h = (rby - lty + TO_REMOVE).clamp(min=0, max=math.inf)  # [N,M]
    inter = w* h  # [N,M]

    area1 = (b1x2- b1x1 + TO_REMOVE) * (b1y2 - b1y1 + TO_REMOVE)  # [N,1]
    area2 = (b2x2- b2x1 + TO_REMOVE) * (b2y2 - b2y1 + TO_REMOVE)  # [1,M]
    iou = inter / (area1 + area2 - inter)
    return iou

yxchng commented 5 years ago

@fmassa Why do torch.jit.script save memory? Why is it not used in the master code when it seems like a very good improvement? Is there any downside?

fmassa commented 5 years ago

@yxchng no downsides. It's not in master because it makes things slightly less unreadable.

It saves memory because it doesn't materialize the intermediate results into large tensors.

ethanweber commented 4 years ago

I was running into the same issue for both this repo and detectron2. I ended up solving it with chunking. Here is some code that I modified:

def pairwise_iou(boxes1: Boxes, boxes2: Boxes) -> torch.Tensor:
    """
    Given two lists of boxes of size N and M,
    compute the IoU (intersection over union)
    between __all__ N x M pairs of boxes.
    The box order must be (xmin, ymin, xmax, ymax).

    Args:
        boxes1,boxes2 (Boxes): two `Boxes`. Contains N & M boxes, respectively.

    Returns:
        Tensor: IoU, sized [N,M].
    """
    area2 = boxes2.area()

    boxes1_tensor, boxes2_tensor = boxes1.tensor, boxes2.tensor

    lt = torch.max(boxes1_tensor[:, None, :2], boxes2_tensor[:, :2])  # [N,M,2]
    rb = torch.min(boxes1_tensor[:, None, 2:], boxes2_tensor[:, 2:])  # [N,M,2]

    N = int(len(boxes1))
    M = int(len(boxes2))
    iou = torch.zeros([N,M]).to(boxes1.device)

    for i in range(0, N, 20):
        area1 = boxes1[i:min(i+20, N)].area()

        wh = (rb[i:min(i+20, N), :] - lt[i:min(i+20, N), :]).clamp(min=0)  # [<=20,M,2]
        inter = wh[:, :, 0] * wh[:, :, 1]  # [<=20,M]

        # handle empty boxes
        iou[i:min(i+20, N), :] = torch.where(
            inter > 0,
            inter / (area1[:, None] + area2 - inter),
            torch.zeros(1, dtype=inter.dtype, device=inter.device),
        )
    return iou

The original code can be found at https://github.com/facebookresearch/detectron2/blob/master/detectron2/structures/boxes.py#L235. It's very similar to maskrcnn_benchmark, and can be adapted to it. I broke it into chunks of size 20. Now at least I can train on my custom dataset with a lot of instances per image.

yonkshi commented 4 years ago

@ethanweber Interesting solution. Did you get a chance to compare it against the torch.jit solution? I am currently using the cpu method, and it's awfully slow. I was considering JIT method but yours seem even better.

ethanweber commented 4 years ago

I didn't compare it with the torch.jit, as I received some errors due to my PyTorch version when trying it. When I used the CPU method, training time went from a few hours to days (estimated time). It's back to normal (few hours) with this method (because it's still using GPU--but just won't load a ton of memory at once).

facebookresearch / maskrcnn-benchmark

[Feature request] Make IoU computation more memory efficient #18

❓ Questions and Help