SysCV / qdtrack

Quasi-Dense Similarity Learning for Multiple Object Tracking, CVPR 2021 (Oral)
Apache License 2.0
384 stars 61 forks source link

How to do batch inference? #88

Closed ghost closed 2 years ago

ghost commented 3 years ago

Hi. I managed to make your code work for inference with 1 image per GPU (so, batch-size of 1) with our custom dataset. Here is the code I am using:

        # get the labels
        tao_labels = []
        tao_file = open('data/tao/annotations/tao_classes.txt')
        for tao_name in tao_file:
            tao_labels.append(tao_name.strip('\n'))

        # create fake bdd100k data
        fake_data = []
        for i in range(len(img_files)):
            fake_label_dict = {'id': '00114122', 'category': 'car',
                               'attributes': {'Occluded': True, 'Truncated': False, 'Crowd': False},
                               'box2d': {'x1': 0, 'x2': 0, 'y1': 0, 'y2': 0}}
            fake_data.append({'name': img_files[i], 'labels': [fake_label_dict],
                              'video_name': 'imgs_fed_to_detector', 'index': i})

        # convert the fake bdd100k file to a coco file
        coco = bdd100k2coco_track([fake_data], True, True)
        with open(os.path.join(root_dir, session_name, 'fake_coco.json'), "w+") as f:
            json.dump(coco, f)
        cfg = Config.fromfile('configs/tao/qdtrack_frcnn_r101_fpn_12e_tao_ft.py')

        # set cudnn_benchmark
        torch.backends.cudnn.benchmark = True
        cfg.model.pretrained = None
        cfg.data.test.test_mode = True

        # build the model and load checkpoint
        model = build_model(cfg.model, train_cfg=None, test_cfg=cfg.get('test_cfg'))
        checkpoint = load_checkpoint(model, 'weights/qdtrack_tao_20210812_221438-b6bd07e2.pth', map_location='cuda')
        model = fuse_conv_bn(model)
        model.CLASSES = checkpoint['meta']['CLASSES']

        # provide your fake coco file to model
        model = MMDataParallel(model, device_ids=[0])
        cfg.data.test['ann_file'] = os.path.join(root_dir, session_name, 'fake_coco.json')
        cfg.data.test['img_prefix'] = os.path.join(root_dir, session_name)

        # where batch_size can be changed, currently I am keeping it to 1
        dataset = build_dataset(cfg.data.test)
        data_loader = build_dataloader(
            dataset,
            samples_per_gpu=batch_size,
            workers_per_gpu=cfg.data.workers_per_gpu,
            dist=False,
            shuffle=False)
        model.eval()

        tracks = []
        for i, data in enumerate(data_loader):
            with torch.no_grad():
                result = model(return_loss=False, rescale=True, **data)

            # get all detections
            tao_dets_of_all_categories = result['bbox_results']
            category_id = 0
            for tao_dets_of_specific_category in tao_dets_of_all_categories:
                for det_of_specific_category in tao_dets_of_specific_category:
                    x1, y1, x2, y2, conf = det_of_specific_category
                    x1, y1, x2, y2, conf = int(x1), int(y1), int(x2), int(y2), round(float(conf), 2)
                category_id += 1

            # get all tracks
            tao_tracks_of_all_categories = result['track_results']
            category_id = 0
            for tao_track_of_specific_category in tao_tracks_of_all_categories:
                for track_of_specific_category in tao_track_of_specific_category:
                    track_id, x1, y1, x2, y2, conf = track_of_specific_category
                    track_id, x1, y1, x2, y2 = int(track_id), int(x1), int(y1), int(x2), int(y2)
                    conf = round(float(conf), 2)
                    tracks.append([track_id, x1, y1, x2, y2, conf, tao_labels[category_id], i])
                category_id += 1

I tried to make it work with a batch size of 2 but I noticed there is no way to know which detection/ track inside the result value belongs to which image in the batch of images passed to the model as input.

Here is my understanding of what result = model(return_loss=False, rescale=True, **data) represents. Result is a dictionary that contains two keys: bbox_results and track_results. For both keys, the value is a list whose size is equal to 482 (the exact number of classes in the TAO dataset). So, as seen above in the code, I get this list and I iterate through each member of the list. Each member, when the key is bbox_result, is a numpy array with shape num_of_detections x 5 where num_of_detections is for the whole batch (?) and 5 because each detection is represented by the bbox coordinates and a confidence value. For track_results, this is num_of_tracks x 6 because you have that one extra value for the track id.

If my above understanding of the structure of results is correct, then it seems to me there is no way to assign a track or detection to a specific image in a batch. Is there a way to do that in the code sample I posted above? Thanks.

OceanPang commented 3 years ago

Hi, we only support an inference batch of 1 by now.

If you want to do the batch inference, you need to pad the input videos to the same length for collact, where you should modify this function.

I recommend you use the current API, but if you still would like to support it, I can help if you have more specific questions.

ghost commented 3 years ago

@OceanPang I do not understand what you mean by padding the input videos. I actually have a set of images extracted from the video. I am using the data loader like this:

data_loader = build_dataloader(
            dataset,
            samples_per_gpu=batch_size,
            workers_per_gpu=cfg.data.workers_per_gpu,
            dist=False,
            shuffle=False)

But the result that I get doesn't help me identify tracks and detections for every image in a batch. Maybe if you can help me understand what I should modify in your code to get the expected result, that would be great. Thanks

OceanPang commented 3 years ago

It's really cumbersome. For example, you have 2 videos with lengths of 100 and 150 frames respectively. When doing batch inference, each GPU should always have a consistent number of videos. So you need to pad the 100 frames video to 150 frames.