PeterL1n / BackgroundMattingV2

Real-Time High-Resolution Background Matting
MIT License
6.81k stars 950 forks source link

Why is foreground prediction necessary? #171

Closed max810 closed 2 years ago

max810 commented 2 years ago

Hello, First of all, good job with the paper! Nicely written and explains a lot of concepts pretty well. However, I am still a little puzzled about why is predicting foreground (or foreground residual in this case) necessary in the pipeline. Consider this example from the demo: image

For the composition (the final step) - why do we use the pixels from upsampled foreground and not from the original image? They are supposed to be identical anyway, because we explicitly train the coarse foreground prediction to replicate the pixels from the original image (in the alpha mask region) (formula 2): image

A possible answer is mentioned in Issue#19, but it's unclear to me what the "background color spill onto partial-opacity hairs and edges" looks like and how does foreground prediction branch mitigate this issue.

I would greatly appreciate an explanation and/or just a side-by-side comparison of 2 images (original vs predicted foreground).

Thank you in advance!

PeterL1n commented 2 years ago

https://github.com/PeterL1n/RobustVideoMatting/issues/42

PeterL1n commented 2 years ago

The foreground is only equal to the source on regions where alpha = 1. But for semitransparent regions, it is not, because part of the original background will leak through. These regions are usually hair, silhouette, and motion blur.

max810 commented 2 years ago

But the foreground will be learned to be the same as the original pixels for all alpha > 0, not just alpha = 1, no?

PeterL1n commented 2 years ago

No, it won't. The dataset provides ground truth foreground F and alpha a. We composite them to a background to synthesize a synthetic source input I = aF + (1-a)B. The model predicts foreground F' and alpha a'. The loss on F' is from ground truth F.

max810 commented 2 years ago

Oh, so we use the properly-extracted foregrounds from the datasets and the model directly learns to remove the background in those situations you described (hair strands, motion blur, etc.). I missed that, sorry.

Thanks for the explanation!