More GPU memory cost when adding SQR to DN-DETR

powermano commented 1 year ago

Do you also have the problem of using a lot of GPU memory during training DN-DETR?

I asked you related questions on Zhihu and thank you very much for your patient answer.

I have fixed the issue about reference point updating when using Iterative Bounding Box Reﬁnement. Reference point will need to be reserved for each query as it varies.

Fangyi-Chen commented 1 year ago

We also observe a (predictable) GPU memory increase during the training of any SQR-based methods, because the increased number of queries are flowing through multiple decoding layers with their backward gradients stored.

We considered how to reduce the negative effect brought by SQR, i.e., the additional training time. Since we used the A100-80GB version, GPU memory was not in our consideration. Different implementations could lead to very different GPU memory overhead. Our implementation is very simple and easy to understand -- basically only a few lines of code -- but it is not the most efficient one. We are glad to receive any advice on faster implementation of SQR!

I also noticed that Group DETR and H-DETR should have similar operation on handling 'groups of query'. I will take a look at their implementation and see if theirs are faster. If they are, I will try to update the implementation in this repo.

Finally, do not hesitate to ask if you have any further questions. Thanks!

powermano commented 1 year ago

SQR does consume some additional GPU memory, but not too much. I thought it was mostly due to code problems. I have already fixed it, and the current training GPU memory is within an acceptable range.

On my own dataset, the AP has increased by 2.5 points compared to DN-DETR, very good work. The DINO trick(look forward twice) can be combined with SQR and bring greater improvement。

SYH9905 commented 1 year ago

SQR does consume some additional GPU memory, but not too much. I thought it was mostly due to code problems. I have already fixed it, and the current training GPU memory is within an acceptable range.

On my own dataset, the AP has increased by 2.5 points compared to DN-DETR, very good work. The DINO trick(look forward twice) can be combined with SQR and bring greater improvement。

你好请问如何将look forward twice和sqr结合起来，类似dn，dab都有x，y，w，h的参数，在进行refine操作时可能会被下面好几层共同影响，该如何去解决这个问题呢

SYH9905 commented 1 year ago

之前进行DAB-DETR 进行look forward twice的时候每一层都通过detach（）进行隔离

SYH9905 commented 1 year ago

We also observe a (predictable) GPU memory increase during the training of any SQR-based methods, because the increased number of queries are flowing through multiple decoding layers with their backward gradients stored.

We considered how to reduce the negative effect brought by SQR, i.e., the additional training time. Since we used the A100-80GB version, GPU memory was not in our consideration. Different implementations could lead to very different GPU memory overhead. Our implementation is very simple and easy to understand -- basically only a few lines of code -- but it is not the most efficient one. We are glad to receive any advice on faster implementation of SQR!

I also noticed that Group DETR and H-DETR should have similar operation on handling 'groups of query'. I will take a look at their implementation and see if theirs are faster. If they are, I will try to update the implementation in this repo.

Finally, do not hesitate to ask if you have any further questions. Thanks!

你好，请问是否会公布SQR-DAB-DETR的代码，我想知道在DAB-DETR已经使用look forward twice的情况下，如何集合sqr进行refine操作

powermano commented 1 year ago

SQR does consume some additional GPU memory, but not too much. I thought it was mostly due to code problems. I have already fixed it, and the current training GPU memory is within an acceptable range. On my own dataset, the AP has increased by 2.5 points compared to DN-DETR, very good work. The DINO trick(look forward twice) can be combined with SQR and bring greater improvement。

你好请问如何将look forward twice和sqr结合起来，类似dn，dab都有x，y，w，h的参数，在进行refine操作时可能会被下面好几层共同影响，该如何去解决这个问题呢

following is my sqr_dn-detr code based on https://github.com/IDEA-Research/detrex repo.

          hidden_states, reference_boxes = self.transformer(
            features,
            img_masks,
            input_box_query,
            pos_embed,
            target=input_label_query,
            attn_mask=[attn_mask, None],  # None mask for cross attention
        )

        if self.training:
            for qid in range(hidden_states.shape[0]):
                # version 2: using the correspoding reference to update the new_reference_point
                if qid < 1:
                    lvl = 0
                elif qid >= 1 and qid < 3:
                    lvl = qid - 1
                elif qid >= 3 and qid < 6:
                    lvl = qid - 2
                elif qid >= 6 and qid < 11:
                    lvl = qid - 4
                elif qid >= 11 and qid < 19:
                    lvl = qid - 7
                elif qid >= 19 and qid < 32:
                    lvl = qid - 12
                else:
                    assert False

                reference = reference_boxes[lvl]

                # Calculate output coordinates and classes.
                reference = inverse_sigmoid(reference)
                anchor_box_offsets = self.bbox_embed(hidden_states[qid])
                outputs_coord = (reference + anchor_box_offsets).sigmoid()
                outputs_class = self.class_embed(hidden_states[qid])  #(layers, bs, num_q+dn_group*max_gt_per_img, 1)

                outputs_coords.append(outputs_coord)
                outputs_classes.append(outputs_class)

            outputs_class = torch.stack(outputs_classes)
            outputs_coord = torch.stack(outputs_coords)

SYH9905 commented 1 year ago

SQR does consume some additional GPU memory, but not too much. I thought it was mostly due to code problems. I have already fixed it, and the current training GPU memory is within an acceptable range. On my own dataset, the AP has increased by 2.5 points compared to DN-DETR, very good work. The DINO trick(look forward twice) can be combined with SQR and bring greater improvement。

你好请问如何将look forward twice和sqr结合起来，类似dn，dab都有x，y，w，h的参数，在进行refine操作时可能会被下面好几层共同影响，该如何去解决这个问题呢

following is my sqr_dn-detr code based on https://github.com/IDEA-Research/detrex repo.

          hidden_states, reference_boxes = self.transformer(
            features,
            img_masks,
            input_box_query,
            pos_embed,
            target=input_label_query,
            attn_mask=[attn_mask, None],  # None mask for cross attention
        )

        if self.training:
            for qid in range(hidden_states.shape[0]):
                # version 2: using the correspoding reference to update the new_reference_point
                if qid < 1:
                    lvl = 0
                elif qid >= 1 and qid < 3:
                    lvl = qid - 1
                elif qid >= 3 and qid < 6:
                    lvl = qid - 2
                elif qid >= 6 and qid < 11:
                    lvl = qid - 4
                elif qid >= 11 and qid < 19:
                    lvl = qid - 7
                elif qid >= 19 and qid < 32:
                    lvl = qid - 12
                else:
                    assert False

                reference = reference_boxes[lvl]

                # Calculate output coordinates and classes.
                reference = inverse_sigmoid(reference)
                anchor_box_offsets = self.bbox_embed(hidden_states[qid])
                outputs_coord = (reference + anchor_box_offsets).sigmoid()
                outputs_class = self.class_embed(hidden_states[qid])  #(layers, bs, num_q+dn_group*max_gt_per_img, 1)

                outputs_coords.append(outputs_coord)
                outputs_classes.append(outputs_class)

            outputs_class = torch.stack(outputs_classes)
            outputs_coord = torch.stack(outputs_coords)

感谢您的帮助和回答

Fangyi-Chen / SQR

More GPU memory cost when adding SQR to DN-DETR #6