allo- commented 4 years ago

Even with tensorflow-gpu, the program uses the CPU (one core) quite much.

Find out what are the slow parts and how good they can be optimized:

Probably quite a few numpy operations could be in-place, but in general I would not expect the numpy operations to be slow for arrays with shape (1280, 720, 4).
Passing around the frame should be a reference and not too slow by itself.
How good is the performance of the cv2 operations?
Is there a bottleneck between tensorflow and cv2/numpy?
Is there data that should be global but is stored in each iteration of the main loop?

schwab commented 4 years ago

Some performance stats.... Running this on a 32 core thread ripper, and it's been up for about 10 hours with a web stream and a simple background image. Currently the process is broken up across ~20 subprocesses. When the process is running it takes the overall system load average from a baseline of around 1.79 to just over 3.0. This is with about 400MB of RAM. I'd estimate then that overall we are using about 1.2 to 1.3 cores at 100%. With that load the delay is just under 1 second.

If we can improve the delay time and bring it down closer to .5 second, even at the cost of more cpu usage I think it would be a good tradeoff especially for those that have cpu bandwidth to spare. Perhaps there could even be a cpu utilization/delay time tradeoff config value that allows allocating more resources to improve the video delay.

allo- commented 4 years ago

I get about 0.1s per mainloop iteration on a Ryzen 8-Core CPU with 100% load on one core.

Evaluating the network seems to be 0.01-0.02s, getting the images and scheduling them seems to be fast. Probably body-pix + filters just sum up. Maybe the segmentation for the next image could run in another thread before the current one is scheduled?

In principle you could process one image per core by just grabbing the next frame when the first core is idle again

For limiting the CPU usage, we could limit the framerate, but it should probably be at least 10 for a good stream.

schwab commented 4 years ago

How are you measuring the loop iteration? In my case I just hold up a stopwatch in front of the camera and observe how far apart the video is from the real one.

allo- commented 4 years ago

timestamp = time.time()
# code to benchmark
print(time.time() - timestamp)

This may not be the best profiler, but its enough to get a rough impression what may be the slow parts.

The latency between webcam and fake webcam is another issue, but for finding the slow parts the most important part is what happens between cap.read and fakewebcam.schedule_frame.

schwab commented 4 years ago

Right, so correctly stated, my latency is about a second. Just reviewed that mainloop code between cap.read and schedule_frame. I see there is a lot going on per frame there.

allo- commented 4 years ago

Some thoughts about performance

I think copying a numpy array is probably fast. Numpy/Scipy do much in C and are in general optimized for performace I would think that the copy operation is implemented using something like memcpy in the backend.
.astype is probably slow.
Adding dimensions may be slow.
OpenCV operations seem to be much slower than numpy
Tensorflow operations are fast, but converting between tensorflow tensors and numpy arrays is probably slow
Operations like cv2.dilate(mask, np.ones((dilate_value, dilate_value), np.uint8)) instantiate a new numpy array, convert an existing one to uint8 and require converting back to numpy afterwards.
I am not sure if frame grabbing and frame scheduling may be slow. In basic tests it seemed okay, but maybe it is possible to already start processing the next frame before the current one is scheduled. I am not sure about when it should be grabbed, as this changes which frames are dropped in between.

Nerdyvedi commented 3 years ago

Using another thread for grabbing frames slightly increases the speed.

allo- commented 3 years ago

@Nerdyvedi Did you test it in some way?

I think it will have quite a bit of issues. When grabbing the next frame at the beginning of the loop you are dropping the in between frames automatically.

When grabbing in an own thread you need a buffer. When the filter thread is ready to process the next frame, the one in the buffer is already stale, so you need to constantly fill a stack and drop the older frames yourself when processing the next one from the top of the stack.

The benefit of avoiding the delay to grab a frame is probably not worth the issues with buffering and synchronization, if you don't have numbers that it will be a lot faster.

allo- commented 3 years ago

Maybe no stack is needed, but just double buffering and correct locking. I am still not convinced if this is the most important part that should be optimized at cost of complexity.

Nerdyvedi commented 3 years ago

@allo- I employed the following tricks to increase the speed, It definitely results in better performance

Using a different thread to grab frame
Replacing .astype('uint8') with .view('uint8') [:,:]
Replacing the loop on output tensor with the fixed index for idx,name in enumerate(output_tensor_names): if name == "float_segments:0": segment_logits = results[idx] with segment_logits = results[6] 6 is the idx for float_segments tensor

allo- commented 3 years ago

1) How do you handle that you can grab several frames before the first is sent to the fake cam? How do you select the next one?

2) Sounds good. The most important point that is not that clear in the code is when a frame should be copied or not. A view can speed up processing frames that are not changed,

3) I think this is needed because mobilenet and resnet are different. But this shouldn't be a bottleneck anyway.

Nerdyvedi commented 3 years ago

Using a separate thread allows frames to be read continuously from the I/O thread, all while our root thread processes the current frame. Once the root thread has finished processing its frame, it grabs the current frame from the I/O thread. But now that I think of it , I am not handling buffer size this way.
I used it only once, In the following line. mask = (mask * 255).astype(np.uint8)
It's faster than looping on the output tensors, for resnet it was 6th index, Similarly we can find it for Mobilenet. But yeah, minimal increase in performance.

allo- commented 3 years ago

1) I guess one would only need two frames. The last one and the one that is currently grabbed and some intelligent locking. 2) Looks good there. I thought about the filter loop, which is more complex. Some plugins convert back and forth and it matters if the plugin returns a copy or not. It is not that clear at which points you are allowed to modify the input. In the end this should be the same optimization, onöly copying the frame when needed, but on the other hand preventing plugins from messing with the input of another one. 3) It's a list of names after all. On the other hand, you cannot just plug in a new model anyway (e.g. without new preprocessing), so we can hardcode the indices as well.

Nerdyvedi commented 3 years ago

Should I create a PR for 2nd and 3rd point?

Nerdyvedi commented 3 years ago

@allo- Also, Would be great if you could share some ideas on how we can use threading ? Like, Some locking methods that you think can work for this problem

allo- commented 3 years ago

I am not that convinced of threading. One would like to have the most recent frame when the current one is finished (discarding all in-between frames) to minimize latency, but to have it you need to capture frames as fast as possible. Assuming the capturing is a bottleneck, you would need to read the last frame and not the currently captured one (which is still unfinished, when reading it is costly) and prevent the race condition of the frame becoming complete and shifting the buffer.

It may be easier at the cost of latency to capture one frame and stop until the current one is processed and then process the (then already quite old) frame and start capturing the next in the background thread.

In both cases there needs to be a lock, for when capturing is slower than processing, to wait for the next frame.

In my experiments, capturing with mjpeg is limited by the webcam framerate and not by reading the frame. Capturing with h264 mode may be a bit slower, but mjpeg is probably optimal for frame by frame processing anyway (and for many cams the default or only format).

And python threads have some extra gotchas with the GIL and similar issues.

So I am really skeptic if shaving off like 10ms capturing is worth it when the model evaluation takes 100ms and filters 200ms.

allo- commented 3 years ago

I created a branch for benchmarking: https://github.com/allo-/virtual_webcam_background/tree/benchmark_webcam_fps

Set the size options, the (max) fps option and the mjpeg option to test the speed of the capturing itself.

I get for 800x600 with 30 fps supported by cam 15 FPS with buffering (comment the cv2.CAP_PROP_BUFFERSIZE line) in the benchmark. This looks suspiciously like 30/2 and I wonder if qv4l2 is lying to me that the cam support 30 fps with this size. I need to test more resolutions.

allo- commented 3 years ago

When I lift the webcam cover and the room is bright enough I get the full framerate for HD easily. The cam seems to be lowering the fps when the image is not bright enough. So I would say capturing is probably no bottleneck.

It may be interesting not to measure the fps in a loop but the time for grabbing a single frame to see if it can be read instantly from a buffer (cam, linux or cv) or if it blocks for one frame, but I still think this is not a bottleneck with high priority.

It could be interesting to parallelize filters (e.g. input (blurring) and foreground (e.g color effects) filters. Especially blur is quite cpu intensive. Here I consider optimizing blur by implementing it in numpy to avoid converting to and from opencv. numpy/scipy should probably be able to do this real fast.

Nerdyvedi commented 3 years ago

@allo- Do you think we should make the changes I suggested(2nd and 3rd not threading). These definitely do improve the performance.

allo- commented 3 years ago

Yes.

Do you allow us to use your patches under MIT license? See #40 for the current license discussion. As long as this is not decided I would like to keep the core contributions under a license that allows for a license change.

allo- commented 3 years ago

@Nerdyvedi The .view patch causes an exception when using a 800x600 JPG image as real_video_device:

ValueError: all the input array dimensions for the concatenation axis must match exactly, but along dimension 1, the array at index 0 has size 800 and the array at index 1 has size 6400

Can you have a look why this does not work? I thought the [:,:] may be reshaping the tensor, but without it the problem persists. I need to check if the mask may have more layers (part segmentations) or if the tensor there is always the full segmentation mask.

janci007 commented 3 years ago

After profiling I found out the most time is spent in calculation of part_masks and heatmap_masks and their inputs scaled_part_heatmap_scores and scaled_heatmap_scores. Those are not needed for many filters, so I suggest disabling their calculation if not required by current filter setup (or at least option to disable them in config). Just commenting out the corresponding lines doubled the frame rate for me (on basic background replacement setup).

allo- commented 3 years ago

This would be a good idea. I just wonder if this shouldn't be the fastest part when there is no bottleneck. I guess opencv, tensorflow and numpy would be able to fastly compute something like tf.greater for huge arrays, but passing numpy arrays to tensorflow is probably the slow part.

Do you have a good setup for benchmarking or does it take much time for you as well? Looking at the code (currently not having the time for debugging it), I think some parts could be done in tensorflow only:

bodypix_functions.py seems to use numpy only for np.floor. This can probably be replaced easily.

I think I used the pattern

    part_masks = to_mask_tensor(scaled_part_heatmap_scores, 0.999)
    part_masks = np.array(part_masks)

to get the tensor as numpy array, but one may be able to do more calculations in tensorflow (and so possibly on the GPU when one has setup CUDA) before converting to numpy.

After converting the tensors to numpy for operations like averaging, they are used with opencv for dilation/erosion/bluring.

So a fast fix would be to disable computing unnecessary image layers. A good feature may be if plugins could request which layers they need. This should also account for different models providing different layers (in different order).

And there is a new faster mobilenet model, that is not supported yet. This will probably make segmentation much faster (and I guess not provide so many part masks).

janci007 commented 3 years ago

I just added timestamps with time.time() to every code section and printed theoretical FPS (1/(endtime-starttime)) that section produces. Will attach the code when I will be back on my pc.

allo- commented 3 years ago

https://ai.googleblog.com/2020/10/background-features-in-google-meet.html

52

janci007 commented 3 years ago

The code mentioned is in #61

allo- / virtual_webcam_background

Performance optimization #20

52