SJTU-LuHe / TransVOD

The repository is the code for the paper "End-to-End Video Object Detection with Spatial-TemporalTransformers"
Apache License 2.0
203 stars 28 forks source link

Running videos for viewable object detection #17

Open zoelevin opened 1 year ago

zoelevin commented 1 year ago

I was wondering how we can run our own videos for object detection so that we can get video output with the bounding boxes and labels like shown in figure 9 of the connected paper? I saw something about how mmtracking has demo scripts, but I couldn't figure out how to use TransVOD similarly to get the results I need. Thank you.

itbergl commented 1 year ago

Maybe not be a perfect solution, but you can run the evaluation script and get the results in the variables target and results.

At the bottom of evaluate() in engine_multi.py you can append the res variable to a list and return it.

def evaluate(model, criterion, postprocessors, data_loader, base_ds, device, output_dir):
   ...
   extracted_results = list()
   for samples, targets in metric_logger.log_every(data_loader, 10, header):
      res = {target['image_id'].item(): output for target, output in zip(targets, results)}
      ...
      extracted_results.append(res)

return stats, coco_evaluator, extracted_results

Then you would unpack an extra tuple item in main.py and you could use pickle to save the results.

To actually view the video you could use open-cv to make a video from the images, utilizing the rectangle function to draw the bounding boxes.

adilsonmedronha commented 1 year ago

thanks @itbergl. I did as you suggested (eg printing the bounding boxes I got):

{654: {'scores': tensor( [0.2183, 0.1489, 0.1324, 0.1290, 0.0712, 0.0674, 0.0617, 0.0446, 0.0436, 0.0353, 0.0341, 0.0338, 0.0338, 0.0326, 0.0317, 0.0314], device='cuda:0'), 'labels': tensor([ 7, 17, 23, 19, 27, 27, 17, 7, 17, 7, 19, 23, 5, 2, 22, 17], device='cuda:0'), 'boxes': tensor([[ 1.7216e+00, 4.1063e+01, 2.6564e+01, 1.0398e+02], [ 4.5522e+02, 1.1981e+02, 5.5804e+02, 1.5068e+02], [ 4.5522e+02, 1.1981e+02, 5.5804e+02, 1.5068e+02], [ 4.3279e+01, 1.1582e+02, 8.4566e+01, 1.7838e+02]]], device='cuda:0')}}

However, the information about the frame associated with image id 654 is not included in that dictionary. How can I know which bb corresponds to each image_id 654 frames?

itbergl commented 1 year ago

@adilsonmedronha maybe try accessing the .json file. You can do something like this to get the frame information:

with open('*.json', 'rb') as f:
   data = json.load(f)

image_id_to_filename = {img["id"]: img["name"] for img in data["images"]}

That should give you a dictionary of image IDs to filenames.