cvlab-columbia / zero123

Zero-1-to-3: Zero-shot One Image to 3D Object (ICCV 2023)
https://zero123.cs.columbia.edu/
MIT License
2.67k stars 192 forks source link

R matrix from gradio_new.py #52

Closed cwwjyh closed 1 year ago

cwwjyh commented 1 year ago

I have checked Appendix Section A of the paper regarding the camera coordinate system. In gradio_new.py, Is the Rotation matrix(camera_R ) in w2c format? In gradio_new.py, you use camera_R to obtain the camera extrinsic but not given T matric of the camera. So I want to use pytorch3d look_at_view_transform to compute extrinsic matric R|t using: R, T = look_at_view_transform(dist=radius, elev=90 - polar_deg, azim=azimuth_deg, up=((0, 1, 0),)). but I got a different result. In the image below, the result of R_jisuan is used pytorch3d to compute, and the camera_R_ori is your code result. how to compute the same R using pytorch3d in your code. Or you can give me advice on how to use your code to compute the T matric of the camera. f60ef7890fbf88315aa0a7a45ff71a6 Looking forward to your reply!

ruoshiliu commented 1 year ago

The matrices calculation in gradio_new.py is only for angle visualization purposes and is independent of the zero123 inference code, so I wouldn't use one to understand another.

Assuming you are asking zero123 inference conditioning, here's the code converting a pair of camera matices (input view and target view) into the 4-dimensional conditioning vector described in appendix section A: https://github.com/cvlab-columbia/zero123/blob/f70ea8c26af7943494f75314df054eb843999add/zero123/ldm/data/simple.py#L257-L272

Hope it makes sense.

P.S. camera matrices over parameterized the transformation so there'll be more than one way to represent an RT operation. I think a better way to compare RT is to visualize the camera or to convert it to Euler or axis angle first.

cwwjyh commented 1 year ago

The matrices calculation in gradio_new.py is only for angle visualization purposes and is independent of the zero123 inference code, so I wouldn't use one to understand another.

Assuming you are asking zero123 inference conditioning, here's the code converting a pair of camera matices (input view and target view) into the 4-dimensional conditioning vector described in appendix section A.

Hope it makes sense.

P.S. camera matrices over parameterized the transformation so there'll be more than one way to represent an RT operation. I think a better way to compare RT is to visualize the camera or to convert it to Euler or axis angle first.

if I use camera_R to calculate the rotation matric in gradio_new.py, is the camera_R the target image's rotation matric or not?

ruoshiliu commented 1 year ago

Sorry but I don't understand. What do you need camera_R, which is used only for visualization, for?

cwwjyh commented 1 year ago

Sorry but I don't understand. What do you need camera_R, which is used only for visualization, for?

camera_R is not used only for visualization. I want to input a single RGB image and then synthesize an image from a specified camera viewpoint. if I give many specified camera viewpoints, I can get many target views. Then, I utilize the target image and corresponding extrinsic matric(eg: camera_R) as a new dataset. So I want to know if the camera_R in gradio_new.py is the target image rotation matric. if yes, I can calculate the translation matric T in gradio_new.py.

ruoshiliu commented 1 year ago

If you want to generate a novel view based on a sampled target viewpoint described by an RT matrix, assuming your input viewpoint is cond_RT, you can sample a target viewpoint target_RT, and use the get_T function to get d_T which is the conditioning information for zero123: https://github.com/cvlab-columbia/zero123/blob/f70ea8c26af7943494f75314df054eb843999add/zero123/ldm/data/simple.py#L257-L272

cwwjyh commented 1 year ago

If you want to generate a novel view based on a sampled target viewpoint described by an RT matrix, assuming your input viewpoint is cond_RT, you can sample a target viewpoint target_RT, and use the get_T function to get d_T which is the conditioning information for zero123:

https://github.com/cvlab-columbia/zero123/blob/f70ea8c26af7943494f75314df054eb843999add/zero123/ldm/data/simple.py#L257-L272

Thank you! I will try it. Can I write the above code in gradio_new.py to get the target Rt matric?

ruoshiliu commented 1 year ago

This function does (cond_RT, target_RT) -> d_T. Sounds like you want (cond_RT, d_T) -> target_RT. You will need to implement that yourself.

cwwjyh commented 1 year ago

This function does (cond_RT, target_RT) -> d_T. Sounds like you want (cond_RT, d_T) -> target_RT. You will need to implement that yourself.

cond_RT means input image RT, target_RT means synthesize view RT, what's mean d_T? 1683256486555

ruoshiliu commented 1 year ago

It means the relative transformation from cond_RT to target_RT, which is used as input to the diffusion model: https://github.com/cvlab-columbia/zero123/blob/f70ea8c26af7943494f75314df054eb843999add/zero123/gradio_new.py#L81-L82

I suggest you try to understand the paper/code before raising an issue here. Closing the ticket for now.