Understanding of num_classes in code

weicheng113 commented 1 year ago

Dear Authors, thanks for sharing high performance models.

I am reading through the DINO model code and get some questions below. Could you please help me?

Is the num_classes = actual_num_classes + 1(background class) in detrex? https://github.com/IDEA-Research/detrex/blob/697f5e9dafab6ea1769ec2ea1e0b65351273aa32/projects/dino/modeling/dino.py#L101

The reason I am asking is because background class should also need a label, as I can see in DINO repo.

https://github.com/IDEA-Research/DINO/blob/66d7173cc4167934381a898b07c08507bdd96b63/models/dino/dino.py#L81 self.label_enc = nn.Embedding(dn_labelbook_size + 1, hidden_dim)

src_logits.shape[batch_size, n_dn_queries=900, num_classes] and target_classes.shape is [batch_size, n_dn_queries=900]. Does num_classes include background class, which means the last logit value is for background class? Is label_id for classes starting from "1" (I can see category_id is starting from 1 in coco dataset)? https://github.com/IDEA-Research/detrex/blob/697f5e9dafab6ea1769ec2ea1e0b65351273aa32/detrex/modeling/criterion/criterion.py#L112

https://github.com/IDEA-Research/detrex/blob/697f5e9dafab6ea1769ec2ea1e0b65351273aa32/detrex/modeling/criterion/criterion.py#L122

If num_classes already includes background class, then +1 in this line is not needed(but cross-entropy loss is not in use, so it does not matter.)? https://github.com/IDEA-Research/detrex/blob/697f5e9dafab6ea1769ec2ea1e0b65351273aa32/detrex/modeling/criterion/criterion.py#L103

I was trying to apply DINO model in my custom dataset. So far it can train, but the performance is not so good. I think I might misunderstand num_classes.

=======UPDATE======

https://github.com/IDEA-Research/detrex/blob/697f5e9dafab6ea1769ec2ea1e0b65351273aa32/detrex/modeling/criterion/criterion.py#L141

I went through the code second time. It looks like for Focal Loss, num_classes only needs to = actual_num_classes(without +1 for background class). For example, there is a dataset with 2 classes: 0, 1. The logits for each prediction only needs 2 numbers, e.g., [0.0145, 0.0111]. If it is mapped to background class '2', onehot encoding will be [0, 0, 1]. We can cut off the last digit of onehot so that [0.0145, 0.0111] is comparing with [0, 0].

So with Focal Loss, we only need to set num_classes=actual_num_classes(without +1 for background class) for all, including the following two locations.

https://github.com/IDEA-Research/detrex/blob/697f5e9dafab6ea1769ec2ea1e0b65351273aa32/projects/dino/modeling/dino.py#L92

https://github.com/IDEA-Research/detrex/blob/697f5e9dafab6ea1769ec2ea1e0b65351273aa32/projects/dino/modeling/dino.py#L101

Background class concept is only limited inside SetCriterion class when trying to produce one-hot encoding and the last digit will be cut off, becoming all zeros for background class.

https://github.com/IDEA-Research/detrex/blob/697f5e9dafab6ea1769ec2ea1e0b65351273aa32/detrex/modeling/criterion/criterion.py#L116

Is my understanding correct?

Thanks, Cheng

HaoZhang534 commented 1 year ago

self.label_enc is used in CDN queries. We do not need a class embedding for the background because CDN queries are generated by GT objects which all have a class. Actually, the extra 1 class embedding is not used in DINO repo.
label_id in detrex starts for 0.

weicheng113 commented 1 year ago

@HaoZhang534 Thanks a lot. I think I understand now.

By the way, I applied DINO to a custom dataset with only 2 classes. I transformed annotations into coco json format. But the metrics did not work. It printed something like below. The model could learn as the loss was going down as expected. I also tested the trained model and it could predict not bad. Could you offer some advice on this? Thanks.

 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.000
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.000
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.000
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.000
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = -1.000
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = -1.000
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.000
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.000
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.000
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.000
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = -1.000
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = -1.000

HaoZhang534 commented 1 year ago

Hi @weicheng113, can you provide more details like why you say "I also tested the trained model and it could predict not bad." given the metrics did not work?

weicheng113 commented 1 year ago

@HaoZhang534 Thanks for your time and help.

I meant the loss was decreasing as expected(the beginning loss was around 60 to 70). Take below loss output for example.

loss=7.69, lr=0.0001, loss_class=0.00258, loss_bbox=0.0155, loss_giou=0.222...]

And I loaded the trained model and tested on images, it gave fairly good predictions on many images. Therefore, I think the model was learning with the custom dataset.

I am not sure where I mis-configured the evaluator part, as it depends on annotation file in coco format self._coco_api =COCO(json_file).

By the way, I tried training model on coco-minitrain, it gave correct metrics. I have not looked into the evaluator code, as it used multiple pycocotools classes, COCOEvaluator, COCOeval and etc.. Below is an example metrics outputed when I tried on coco-minitrain data.

 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.496
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.683
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.541
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.308
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.530
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.677
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.385
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.630
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.708
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.510
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.755
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.886

HaoZhang534 commented 1 year ago

Hi @weicheng113, I suggest you visualize the predictions on the validation set to see wether the problem is on the model or the evaluator.

weicheng113 commented 1 year ago

Hi @HaoZhang534 , thanks a lot. I got it working by using MeanAveragePrecision from torchmetrics, which has simpler interface. I will continue with training. When I get more metrics information, I will ask for your advice with finetuning.

weicheng113 commented 1 year ago

@HaoZhang534 may I ask for your suggestion.

I have a dataset about 1,100 examples, 20% for validation and 80% for training. The dataset only has two categories, 0 and 1.

Below is the histogram of number of instances in each image file. Most of them has two objects in an image and maximum 4 objects in an image.

Category 1 has much less instances than category 0 as shown below, which I think it is ok for focal loss.

=========

I tried with the following configuration.

num_queries: 40
num_dn_queries: 10(if there is 2 instances in an image, it will have 5 DN groups - 5* [2 * 2])
num_select: 20 ( select top 20 confident prediction for validation)
num_epoch: 50 ( with StepLR(optimizer=optimizer, step_size=20) )
train_batch_size: 2
gradient_accumulation_steps: 1 (only on single gpu machine. doing gradient decent at every step).

After 50 epochs, I got below, which is not as good as our YOLOv5 version:

Epoch 49: 100%|██████████| 912/912 [07:43<00:00,  1.97it/s, loss=5.56, v_num=46, lr=1e-6, loss_class=0.013, loss_bbox=0.0112, loss_giou=0.142, val_loss=2.130]

{'map': tensor(0.7773),
 'map_50': tensor(0.9670),
 'map_75': tensor(0.9138),
 'map_large': tensor(0.8001),
 'map_medium': tensor(0.4979),
 'map_per_class': tensor(-1.),
 'map_small': tensor(-1.),
 'mar_1': tensor(0.5967),
 'mar_10': tensor(0.8588),
 'mar_100': tensor(0.8595),
 'mar_100_per_class': tensor(-1.),
 'mar_large': tensor(0.8821),
 'mar_medium': tensor(0.5853),
 'mar_small': tensor(-1.)}

======= After looking the above instance stats, I am thinking having experiments with the following two configuratiion:

num_queries: 60
num_dn_queries: 16(if there is 2 instances in an image, it will have 8 DN groups - 8* [2 * 2])
num_select: 6 ( instead of using 4, leaving a bit room)
num_epoch: 50 ( with StepLR(optimizer=optimizer, step_size=20) )
train_batch_size: 2
gradient_accumulation_steps: 1 (only on single gpu machine. doing gradient decent at every step).

num_queries: 60
num_dn_queries: 16(if there is 2 instances in an image, it will have 8 DN groups - 8* [2 * 2])
num_select: 6 ( instead of using 4, leaving a bit room)
num_epoch: 50 ( with StepLR(optimizer=optimizer, step_size=20) )
train_batch_size: 2
gradient_accumulation_steps: 3 (only on single gpu machine. doing gradient decent at every 3 steps).

Any advice is highly appreciated.

Thanks, Cheng

HaoZhang534 commented 1 year ago

Is the mAP 0.777% or 77.7%? Given that you have at most 4 instances per image. You can use small num_queries such as 40. num_select can be set larger such as 40. You can also load our coco pre-trained model to fine-tune the model on your dataset.

weicheng113 commented 1 year ago

@HaoZhang534 Thanks a lot. mAP is 77.7%.

Could you explain why to use larger num_select? I thought num_select should be close to max number of instances?

Good suggestion. I will try with pre-trained model out.

I found that the model did not do well on category 1(which has less instances). It also has overlapping predictions, two similar bounding boxes predicted(which is a bit surprise, as CDN is created to eliminate similar predictions).

Thanks, Cheng

HaoZhang534 commented 1 year ago

@weicheng113 Larger num_select usually leads to higher AP. About category 1, you may try some tricks to balance the proportions of the two categories. For example, you can add some copies of images from category 1 in the training data.

weicheng113 commented 1 year ago

@HaoZhang534 Thanks for the explanation. For final inference, is there a rule to pick the confidence_threshold? I saw 0.5 as default in detrex and 0.3 in DINO repo.

HaoZhang534 commented 1 year ago

@weicheng113 Which confidence_threshold do you mean? We do not use confidence_threshold to evaluate. We use confidence_threshold=0.3 for visualization.

weicheng113 commented 1 year ago

Got you, thanks a lot @HaoZhang534 .

weicheng113 commented 1 year ago

@HaoZhang534 I realized I can't directly load a pretrained dino model, as there is an incompatibility in num_classes, num_queries and num_dn_queries. I can load other weights like transformer weights, but something like class_embed, label_enc need to be retrained.

I will have a try to see if loading a pretrained dino model is helpful. Thanks.

weicheng113 commented 1 year ago

@HaoZhang534 Thanks a lot. Just let you know I have tried with a pretrained DINO model, dino_swin_tiny_224_22kto1k_finetune_4scale_12ep.pth. It was much quicker to train for the first few epoches. And there is a slight improvement in the final mAP and it was with num_select=40.

I made a mistake when testing the model before. I did not load the correct trained model. The performance of previous model was quite good as well as the newly-trained model with the pretrained DINO model.

Epoch 49: 100%|██████████| 912/912 [07:48<00:00,  1.95it/s, loss=2.67, lr=1e-6, loss_class=0.00316, loss_bbox=0.00502, loss_giou=0.0413, val_loss=1.710]

{'map': tensor(0.7878),
 'map_50': tensor(0.9703),
 'map_75': tensor(0.9312),
 'map_large': tensor(0.8038),
 'map_medium': tensor(0.5119),
 'map_per_class': tensor([0.8581, 0.7176]),
 'map_small': tensor(-1.),
 'mar_1': tensor(0.5966),
 'mar_10': tensor(0.8496),
 'mar_100': tensor(0.8532),
 'mar_100_per_class': tensor([0.9112, 0.7952]),
 'mar_large': tensor(0.8685),
 'mar_medium': tensor(0.5519),
 'mar_small': tensor(-1.)}

By the way, I rewrote and refactored ContrastiveDeNoising part and moved it to DataCollatorForTraining. I feel most CDN work can be done in data loader workers to improve gpu usage. I got one concern about CDN. When generating a negative noisy bounding box, it could be a valid positive bounding box for another ground truth bounding box, although the chance can be rare.

Thanks, Cheng

HaoZhang534 commented 1 year ago

@weicheng113 You are welcome. Your concern is reasonable. It's really a problem when objects are crowded. Maybe some improvements can be made to fix this such as only use negative examples when objects are not crowded.

ma3252788 commented 1 year ago

Excuse me, can you tell me how to load the custom dataset?

Would be very grateful if it could be done.

jiachen0212 commented 1 year ago

Hi @HaoZhang534 , thanks a lot. I got it working by using MeanAveragePrecision from torchmetrics, which has simpler interface. I will continue with training. When I get more metrics information, I will ask for your advice with finetuning.

halou i got the same question，can you give more detail about how to fix this bug？

IDEA-Research / detrex

Understanding of num_classes in code #174