The depths of the input views are unknown, we only input RGB source views.
We warp the features of the input views to the target viewpoint, constructing a cost volume, and then regress to obtain the depth at the target viewpoint. With the estimated depth at the target view, we can sample points for rendering.
The depths of input views are already known. But for a target view, how do you get the depth of the only sample point for the volume rendering?