cvat-ai / cvat

Annotate better with CVAT, the industry-leading data engine for machine learning. Used and trusted by teams at any scale, for data of any scale.
https://cvat.ai
MIT License
12.68k stars 3.02k forks source link

Support for Multi-Object Multi-Camera Tracking #4915

Open saurabheights opened 2 years ago

saurabheights commented 2 years ago

My actions before raising this issue

Expected Behaviour

  1. Be able to track objects across videos from multiple cameras.
  2. Cameras may or may not have intersecting field of view.
  3. Support for automatic annotation to simplify the process.

Current Behaviour

I need to track each object visible on 4 cameras. AFAIK, CVAT doesnt support multiple cameras. To workaround this problem, I have created a single video by tiling video from 4 cameras.

However, to annotate faster, I would prefer to have some form of automatic annotations, or atleast semi-automatic with minimal supervision. I have tested object detection model as well as SIAMMASK, but both comes with their own problems.

  1. SIAMMASK adds a lot of delay in pressing next button. [I have used chunk size of 10, so image loading shouldnt be an issue]. Higher image size might be a problem, i.e. increases inference time but I will need to further debug and verify it.
  2. Doesnt generate any results. When object moves, the tracking bounding box stays at the same place. This might be due to pressing - "Go next with a step [V]" button, which is needed, because pressing next frame by frame, will take 5 minutes per 10 frames and objects can be stationary for 100 of frames.
  3. What I would have preferred is for CVAT to process next N frames with SIAM tracker. If a new object enters the vision in those N frames, I would update trackers and submit a reprocessing request.

Q. Please can you provide any ideas to improve this process in general or correct me where you think I might be doing it wrong?

Another idea I have is to go for adding an object detection and tracking model that doesnt require seeds. Use it instead of SIAMMASK to generate automatic annotations before manual annotation process. However, I am not sure if tracking annotations generated by a model can be directly ingested into CVAT.

Your Environment

nmanovic commented 2 years ago

@saurabheights , thanks for submitting the issue.

  1. Probably you run FasterRCNN on CPU. Need to try on GPU. Also, you can look at other detection models in our portfolio or add your own. It is true that FasterRCNN doesn't generate tracks. Need to post-process its annotations now. Another approach is to run ReID model which combines the same object on different frames into one track. You also should be able to run it in CVAT: https://github.com/opencv/cvat/tree/develop/serverless/openvino/omz/intel/person-reidentification-retail-300/nuclio

CVAT doesn't have batch processing for now. I remember that users complained about that.

  1. Again, try to run SiamMASK on GPU. Probably it will help. Remember, that CVAT doesn't track resources and if you have a small amount of GPU memory, it will lead to problems if you load more than one DL model on one machine.

Also, we have the OpenCV tracker and https://github.com/opencv/cvat/pull/4886 from @dschoerk. We will merge it soon.

  1. In the past we solved very similar problems. We did it in the following way:
    • Annotate tracks using interpolation for each video independently. In our case, it works very well.
    • Now we have N1, N2, N3, N4 tracks on all videos. We run ReID algorithm on ~N1xN2xN3xN4 pairs and filter matches by a threshold. In general, the number of tracks is small. Let's say Nx ~ 100 (an it is a lot). Thus you will get 100,000,000 combinations. You can use some known information to reduce the number of pairs. The ReID algorithm should reduce the number by 1000 times at least. You will have 100,000 pairs of tracks that probably match each over.
    • Generate an image that has 5 frames from the first track and 5 frames from the second track. You will get 100,000 such images in our case.
    • Now you can classify them. The usual speed to classify such objects is 1000 images per hour. Approximately 2 weeks to classify all pairs and get perfect quality. Probably you will match all objects on all videos.

The number of steps is big, but it will give you the best quality.

A related article: https://arxiv.org/pdf/2003.07618.pdf

dschoerk commented 2 years ago

@saurabheights i noticed you are using v2.1.0, when using any single object tracker you should experiment with the develop branch. until recently there was an off-by-one error in the frame number of the prediction.

https://github.com/opencv/cvat/issues/4870 (the relevant commit is linked in this issue, use any newer version than this)

Doesnt generate any results. When object moves, the tracking bounding box stays at the same place. This might be due to pressing - "Go next with a step [V]" button, which is needed, because pressing next frame by frame, will take 5 minutes per 10 frames and objects can be stationary for 100 of frames ... What I would have preferred is for CVAT to process next N frames with SIAM tracker. If a new object enters the vision in those N frames, I would update trackers and submit a reprocessing request.

i would like to see this as a new feature! currently the single object tracking is stateless on the side of nuclio, which means that the tracker state is sent to nuclio for each frame. without testing i think this is a significant computation overhead. at some point i had a tracker state of ~3mb for the TransT implementation, but haven't investigated it further. for siamese trackers like SiamMask and also TransT this at least includes the cropped search region and template image in some shape or form. just an fyi: TransT is slightly slower for me than SiamMask (using an RTX 3090), but is far more accurate in my use case of pedestrian tracking.

a neat benefit of siamese trackers over object detection is that they are typically class agnostic.