IDEA-Research / DINO

[ICLR 2023] Official implementation of the paper "DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection"
Apache License 2.0
2.16k stars 234 forks source link

Attempt at the LFT, CDN, MQS modules implementations. #12

Closed Li-Qingyun closed 2 years ago

Li-Qingyun commented 2 years ago

Thanks to the authors for their awesome word! Although the source code is not available yet, I can't wait to attempt at reproducing the modules.

I submit this issue to ask if the LFT module can be implemented by modifying the following one line of code.

The original code in dn-dab-deformable-detr is:

intermediate_reference_points.append(reference_points)

The modified code is:

intermediate_reference_points.append(new_reference_points)
SlongLiu commented 2 years ago

Yes, you are right. It can be implemented with the modification.

Li-Qingyun commented 2 years ago

Yes, you are right. It can be implemented with the modification.

Thanks for your reply! Additionally, for the CDN module, I'm still confused about whether a query is positive or negative.

Li-Qingyun commented 2 years ago

In the original DN:

The generated noises are no larger than $\lambda$ as DN-DETR wants the model to reconstruct the ground truth (GT) from moderately noised queries.

The corresponding code is:

# noise on the box
if box_noise_scale > 0:
    diff = torch.zeros_like(known_bbox_expand)
    diff[:, :2] = known_bbox_expand[:, 2:] / 2
    diff[:, 2:] = known_bbox_expand[:, 2:]
    known_bbox_expand += torch.mul((torch.rand_like(known_bbox_expand) * 2 - 1.0),
                                   diff).cuda() * box_noise_scale
    known_bbox_expand = known_bbox_expand.clamp(min=0.0, max=1.0)

In the implementation, noise offsets of different value are added to the four coordinates of GT boxes.

The degree of noise on $x, y, w, h$ varies, as long as they are all less than 'box_noise_scale'. The implementation is concise and uncontested.

In the CDN:

Positive queries within the inner square have a noise scale smaller than $\lambda{1}$ and are expected to reconstruct their corresponding ground truth boxes. Negative queries between the inner and outer squares have a noise scale larger than $\lambda{1} $ and smaller than $ \lambda_{2} $. They are are expected to predict “no object”.

Does this mean that the random offsets added to the four coordinates of a GT box for generating a negative example are all larger than $\lambda{1} $ and smaller than $ \lambda{2} $? [Plan A]

Or another way: If all offsets added to $x, y, w, h $ are less than $\lambda{1} $ , the query is positive. Otherwise, if one of the four offsets for a GT box is greater than $\lambda{1} $ , it is judged as a negative query. [Plan B]

In the above plans, refer to the DN, I generate random noise offsets of scale $\lambda{2} $, and set various conditions to determine whether a query should be positive or negative, which makes it difficult to control for the same amounts of the positive and negative queries. **[Plan A] and [Plan B]_**

Or maybe I should generate the two noises separately. For generating positive queries, the noise offsets are random values between $-\lambda{1} $ and $\lambda{1} $ . For negative ones, they are between $-\lambda{2} $ and $-\lambda{1} $ or between $\lambda{1} $ and $\lambda{2} $. [Plan C]

My implementation of [Plan A] is:

# box_noise_scale: Sequence[int]
box_noise_scale = sorted(box_noise_scale)
if box_noise_scale[1] > 0:
    midlake_island_ratio = box_noise_scale[0] / box_noise_scale[1]
    diff = torch.zeros_like(known_bbox_expand)
    diff[:, :2] = known_bbox_expand[:, 2:] / 2
    diff[:, 2:] = known_bbox_expand[:, 2:]
    rand_scale = torch.rand_like(known_bbox_expand) * 2 - 1.0
    known_bbox_expand += torch.mul(rand_scale, diff).cuda() * box_noise_scale[1]
    known_bbox_expand = known_bbox_expand.clamp(min=0.0, max=1.0)
    # change the gt category of the negative sample to no obj.
    negative_indice = torch.nonzero(
        torch.all(rand_scale >= midlake_island_ratio, dim=1)).view(-1)
    negative_label = torch.zeros_like(negative_indice)
    known_labels.scatter_(0, negative_indice, negative_label)

My implementation of [Plan B] is: There is only one line modified.

# box_noise_scale: Sequence[int]
box_noise_scale = sorted(box_noise_scale)
if box_noise_scale[1] > 0:
    midlake_island_ratio = box_noise_scale[0] / box_noise_scale[1]
    diff = torch.zeros_like(known_bbox_expand)
    diff[:, :2] = known_bbox_expand[:, 2:] / 2
    diff[:, 2:] = known_bbox_expand[:, 2:]
    rand_scale = torch.rand_like(known_bbox_expand) * 2 - 1.0
    known_bbox_expand += torch.mul(rand_scale, diff).cuda() * box_noise_scale[1]
    known_bbox_expand = known_bbox_expand.clamp(min=0.0, max=1.0)
    # change the gt category of the negative sample to no obj.
    negative_indice = torch.nonzero(
        torch.any(rand_scale >= midlake_island_ratio, dim=1)).view(-1)  # The modification
    negative_label = torch.zeros_like(negative_indice)
    known_labels.scatter_(0, negative_indice, negative_label)

My implementation of [Plan C] is:

......

# add noise
known_indice = known_indice.repeat(scalar * 2, 1).view(-1)
known_labels = labels.repeat(scalar * 2, 1).view(-1)
known_bid = batch_idx.repeat(scalar * 2, 1).view(-1)
known_bboxs = boxes.repeat(scalar * 2, 1)
known_labels_expaned = known_labels.clone()
known_bbox_expand = known_bboxs.clone()

# noise on the label
if label_noise_scale > 0:
    p = torch.rand_like(known_labels_expaned.float())
    chosen_indice = torch.nonzero(p < (label_noise_scale)).view(-1)  # usually half of bbox noise
    new_label = torch.randint_like(chosen_indice, 0, num_classes)  # randomly put a new one here
    known_labels_expaned.scatter_(0, chosen_indice, new_label)
# change the gt category of the negative sample to no obj.
chosen_negative_indice = torch.ones_like(known_labels)
num_targets = int(len(chosen_negative_indice) / (scalar * 2))
for i in range(scalar):
    chosen_negative_indice[2 * i * num_targets: (2 * i + 1) * num_targets] = 0
chosen_negative_indice = torch.nonzero(chosen_negative_indice).view(-1)
negative_labels = torch.zeros_like(chosen_negative_indice)
known_labels.scatter_(0, chosen_negative_indice, negative_labels)

# noise on the box
box_noise_scale = sorted(box_noise_scale)
if box_noise_scale[0] > 0:
    diff = torch.zeros_like(known_bbox_expand)
    diff[:, :2] = known_bbox_expand[:, 2:] / 2
    diff[:, 2:] = known_bbox_expand[:, 2:]
    rand_scale = torch.rand_like(known_bbox_expand) * 2 - 1.0
    scale = torch.ones_like(known_bbox_expand)
    bias = torch.zeros_like(known_bbox_expand)
    for i in range(4):
        temp = (box_noise_scale[1] - box_noise_scale[0]) / box_noise_scale[0]
        scale[:, i].scatter_(
            0, chosen_negative_indice, torch.ones_like(chosen_negative_indice.float()) * temp)
        bias[:, i].scatter_(0, chosen_negative_indice, torch.ones_like(chosen_negative_indice.float()))
        for j in range(scalar):
            # for negative minus
            minus_indice = (2 * j + 1) * num_targets + torch.nonzero(
                rand_scale[(2 * j + 1) * num_targets: 2 * (j + 1) * num_targets, i] < 0).view(-1)
            bias[:, i].scatter_(0, minus_indice, -torch.ones_like(minus_indice.float()))
    scale *= box_noise_scale[0]
    bias *= box_noise_scale[0]
    rand_scale = rand_scale * scale + bias

    known_bbox_expand += torch.mul(rand_scale, diff).cuda()
    known_bbox_expand = known_bbox_expand.clamp(min=0.0, max=1.0)

......
single_pad = int(max(known_num)) * 2  # modification
......
# map in order
map_known_indice = torch.tensor([]).to('cuda')
if len(known_num):
    map_known_indice = torch.cat([torch.tensor(range(num * 2)) for num in known_num])  # [1,2, 1,2,3] # modification
    map_known_indice = torch.cat([map_known_indice + single_pad * i for i in range(scalar)]).long()
if len(known_bid):
    input_query_label[(known_bid.long(), map_known_indice)] = input_label_embed
    input_query_bbox[(known_bid.long(), map_known_indice)] = input_bbox_embed
......

The above three plans all have certain problems. For [Plan A] and [Plan B], the group scheme of CDN (If an image has n GT boxes, a CDN group will have 2 × n queries with each GT box generating a positive and a negative queries) will be hard to achieved. For [Plan C], my implementation seems not concise, and the decision domains of positive and negative queries are not contiguous (When the offset of the four coordinates are all less than $\lambda{1} $ , the query is positive. When they are all greater than $\lambda{1} $ , it is negative. The cases when some are greater than $\lambda_{1} $ and some are less are ignored, whose anchors are also likely to be rejected).

Therefore, I'm curious about your wisdom in implementing the CDN modules. More details or pseudo code are expected.

Thank u very much! :grinning:

FengLi-ust commented 2 years ago

Hey, sorry for the late reply. Our code is available now, you refer to it for more details.

Li-Qingyun commented 2 years ago

Hey, sorry for the late reply. Our code is available now, you refer to it for more details.

OK, thank you!

Li-Qingyun commented 2 years ago

@FengLi-ust @SlongLiu I have finished reading the code. Your implementation is concise. Thanks for your replies and innovative work!