fudan-zvg / PVG

Periodic Vibration Gaussian: Dynamic Urban Scene Reconstruction and Real-time Rendering
https://fudan-zvg.github.io/PVG/
MIT License
223 stars 7 forks source link

Info regarding velocity implementation #31

Open arstek131 opened 4 weeks ago

arstek131 commented 4 weeks ago

Hi, thank you for your nice work. I've mainly two questions, regarding the concept of velocity in you paper and implementation.

1) Could you argument more about the mean when it's time dependent? $\tilde{\mu}(t) = \mu + \frac{l}{2\pi} \cdot \sin\left( 2\pi \frac{t - \tau}{l} \right) \cdot v$ Why did you model it using sin()? What is the reason behind this choice? Also could you explain better $v = \left. \frac{d\tilde{\mu}(t)}{dt} \right|_{t=\tau}$ I got that it's the instant velocity, but how it's interpreted in the code? What is the unit of measure?

2) Regarding the code implementation In train.py at each iteration you calculate velocity like this: v = gaussians.get_inst_velocity Then you pass it to the render function render_pkg = render(viewpoint_cam, gaussians, args, background, env_map=env_map, other=other, time_shift=time_shift, is_training=True) Once rendering is completed you get the render velocity as: feature = render_pkg['feature'] / alpha.clamp_min(EPS) v_map = feature[1:]

And v_map is a torch tensor with 3 channels, and I suppose that each channel describes the instantaneous velocity of that point in the x, y, and z directions respectively. In which values this v_map is normalized? What is the unit of measure?

Thanks

Fumore commented 4 weeks ago

Hi, sorry for the confusion. 1. we use sin() because this kind of periodic function can both model the dynamic and static well. ( when $\beta$ is small, the point move linearly and fade away, while when $\beta$ is large, it tends to be static around $\mu$.) The dimension of $v$ is $m\/s$ and the dimension of $l$ is $s$. We parameter the $v$ in gaussians._velocity. There's some naming confusion about gaussians.get_inst_velocity, actually we get the $\bar{v}$ here instead of the instant velocity at a certain time.

  1. The understanding of v_map is right. And it is normalized by accumulative opacity alpha which is a nondimension parameter.
arstek131 commented 4 weeks ago

Hi, thank you for the clarifications! So if I get it right gaussians.get_inst_velocity is the $\bar{v}$ (average velocity) that in the paper is defined as $\bar{v} = v \cdot \exp(-\frac{\rho}{2})$ while $v$ is gaussians._velocity that in the paper is defined as instant velocity $v = \left. \frac{d\tilde{\mu}(t)}{dt} \right|_{t=\tau}$

So v_map represents the rendered average velocity and not the instantaneous?

I see by debugging the code that gaussians._velocity is a tensor torch.Size([2146010, 3]) (which I think represents for each Gaussian point the velocity in x,y,z). Now, for each frame in the scene I've available the ground truth velocity (instant velocity) of the objects, represented as a torch tensor (H, W), where each pixel has a velocity value (basically I've the velocity map).

Do you have any suggestion about which velocity from the model I should use and how? My goal is to supervise the predicted velocity with the ground truth one I've. If you feel more comfortable, you can pm me. Thanks!

Fumore commented 3 weeks ago

okey, I think using map of velocity which is used in temproal smoothing is more reasonable, i.e. the map of $\bar{v}$. Because we actually use $\bar{v}$ as a estimated 3D scene flow for self-supervision (temproal smoothing).

arstek131 commented 3 weeks ago

Great, thanks for you reply.

When the velocity and other features are passed to the rasterizer, in case of the velocity map, what is the meaning of the values pixelwise of the rasterized image (v_map)? Because as far as I've understood they don't represent velocity values in $m/s$, how should I interpret them?

Thanks

Fumore commented 3 weeks ago

Why doesn't the v_map indicates velocity in $m\/s$ (each channel)? Roughly speaking, each pixel represents the expectation of velocity on the corresponding ray (using alpha blending weights as the probability distribution).

arstek131 commented 3 weeks ago

Ok, but how should I interpret this pixel representation? For example, it is possible to recover the velocity, in $m/s$, from the rendered v_map? If yes, how?

Fumore commented 3 weeks ago

Such as projecting the objects' velocity as well as their masks to the camera images to get the GT v_map label, or using depth map and back project the v_map to the 3D space as point cloud or directly supervising the PVG points.