Training with our own data

zawlin commented 4 years ago

Hi,
I have a few questions on how the data should be formatted and the data format of the provided dryice1.

The model expects world space coordinate in meters? i.e if my extrinsics are already in meters do I still need the world_scale=1/256. in config.py file?
The extrinsics are in world2cam and the rotation convention is like opencv? i.e, y-down,z-forward and x-right, assuming identity for pose.txt file?
how long do I need to train for about 200 frames? And in the config.py file it seems you are skipping some frames? This is ok to do for my own sequence as well?
in the KRT file, I see that there's 5 parameters above the RT matrix. This is the distortion correction in opencv format? But it is not used yes?
I did not visualize your cameras, so I am not sure how they are distributed. Is it gonna be a problem if I use 50 cameras equally distributed in a half-hemisphere and the subject is already at world origin and 3.5 meters from every cameras? My question is do I need to filter the training cameras so that the back side of subject that is not seen by input 3 cameras is excluded?
How do I choose the input cameras? I have a visualization of the cameras . Which camera config should I use? Is this more a question of which testing camera poses I intend to have, i.e narrower the testing cameras' range of view, the closer input training cameras can be? Config_0 is more orthogonal and Config_1 sees less of the backside.

stephenlombardi commented 4 years ago

Hi, thanks for reaching out!

There's no expectation of a particular measurement unit (ie meters). The purpose of world scale is to convert the coordinates of the volume to lie in [-1, 1]^3 because that's the coordinate system that torch.functional.grid_sample uses. So worldscale actually dictates the size of the volume that is modeled because everything beyond -1 or 1 is cut off.
Yes, opencv convention for rotation.
In general I find around 200k-500k iterations gives a good result depending on the complexity of the scene. Returns tend to diminish after this point. For 200 you can probably err towards the shorter end of the range. I believe the included config.py file uses all available frames.
Yes, those are distortion parameters but the images are all undistorted.
No, don't filter out cameras; you want to use as many cameras as you can. The system can still model the details that the "input" cameras don't see (this is because it still gets training examples from other viewpoints).
The selection of the input cameras actually does not matter much, either one of those camera configurations should give similar results. As stated before, even though the encoder network only sees the 3 input images, it's only using them to produce a latent code and it's easy for the encoder network to include information about the entire scene into the codes (this is a feature, not a bug).

If you have any more questions or need me to clarify anything else, let me know!

zawlin commented 4 years ago

Can you check the shared folder again? I uploaded a sample result at iteration 1.

I know that my object is at origin. the cameras are all looking at origin as well. pose.txt is set to identity

From the look of it it seems the scale is a little wrong? Can I assume that my camera conventions are correct and that only scale needed to be changed

stephenlombardi commented 4 years ago

Yes it appears that the camera configuration is roughly correct. It's good that you can see the initial volume in all viewpoints. You might try increasing the world scale to get the volume to occupy the entire object.

zawlin commented 4 years ago

If I increase the world scale, it seems to just grow at the corner until it occupies the upper left quadrant. I think something is still not quite right. If everything is working, I expect the initial volume to be exactly in the center since I know exactly where the camera is supposed to be looking (0,0,0) and the pose.txt applys no transformation. Do you know what might be an issue?

stephenlombardi commented 4 years ago

Check line 65 and 66 of data/dryice1.py, it divides the focal length and principal point by 4 because the training data is downsampled from the original resolution. You probably don't want that.

zawlin commented 4 years ago

yes that seems to do the trick. can you check the shared folder again? does it look like I need to adjust world scale or it's just fine?

stephenlombardi commented 4 years ago

Looks pretty good to me, you'll probably have a better idea once it starts training.

zawlin commented 4 years ago

Thanks for helping!

zawlin commented 4 years ago

Hi, can you please check the share folder again? I uploaded ground truth and rendered results.

Will it still keep improving or should I train a few more days?
Will the fog effect go away if I train longer?
The render results look too bright, and the background is black instead of greyish. Do the bg needed to be added in as a processing step? does the fixedcammean parameters matter? I think that was just doing zero meaning based on 255 range right?

stephenlombardi commented 4 years ago

You might see some additional high frequency detail if you keep training as this tends to come in last
The fogginess will probably not go away
The code does color correction and gamma correction to the images for rendering (since we assume our data is in linear color space). Check out the file eval/writers/videowriter.py. On line 5 you'll see a bit which does gamma correction (... ** (1. / 1.8) ...). If your images are already gamma corrected you'll want to get rid of the exponent (or just change the 1.8 to a 1.). By default, when rendering a video not from one of the camera viewpoints, no background is added to the image because the backgrounds are considered to be camera dependent. On line 29 you'll see default arguments for background color which you can change to add an RGB background to that rendering. Also on line 29 is the colcorrect argument which will scale the RGB values. You'll want to set that to [1., 1., 1.] so it does nothing.

zawlin commented 4 years ago

Hi, It's working very well on synthetic data. However, I have some trouble getting it working on real data. I am using microsoft's fvv paper's data and some data we captured ourselves. Basically, after a while, the training just output background image. I have manually adjusted through trial and error pose.txt so that the volume is visible in all cameras and set world scale to 1/2, so that i don't have to spend too much time tweaking. At world scale 1, the volume is cut off in some cameras, but at world scale 1/2, it looks fine? Can you take a look at the progress images under real_data?

Is it important that the object is exactly in the center of camera array?
How many cameras are needed to be able to get geometry? I have 32 in our data, lincoln data has 52.

stephenlombardi commented 4 years ago

It shouldn't be very important that the object is exactly centered or not. We only used 34 cameras in the experiments in the paper.

If I had to guess based on the progress images, I would say that it looks like the camera parameters may not be set up correctly. If you look at the first progress image for the lincoln example prog_000003.jpg, the last row shows 4 views located behind the person but the rendered volume looks drastically different for each of them. I would expect it to be more similar if the camera parameters are correct.

If you're sure the camera parameters are correct and in the right format, one thing you can try is to try to train a model without the warp field, as sometimes it can cause stability problems in some cases.

zawlin commented 4 years ago

Hmm, the same parameters were used for training scene representation net, as well as my implementation of visual hull, so I feel the cameras might be alright. And the conversion code from my format to nv format is applied similarly to previous synthetic data.

But I am not 100% certain about pose transformations. How exactly did you obtain those numbers for your dataset?

For disabling the warp field, is it enough to set self.warp to None in Decoder class?

On Wed, Feb 26, 2020 at 12:23 AM Stephen Lombardi notifications@github.com wrote:

It shouldn't be very important that the object is exactly centered or not. We only used 34 cameras in the experiments in the paper.

If I had to guess based on the progress images, I would say that it looks like the camera parameters may not be set up correctly. If you look at the first progress image for the lincoln example prog_000003.jpg, the last row shows 4 views located behind the person but the rendered volume looks drastically different for each of them. I would expect it to be more similar if the camera parameters are correct.

If you're sure the camera parameters are correct and in the right format, one thing you can try is to try to train a model without the warp field, as sometimes it can cause stability problems in some cases.

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/facebookresearch/neuralvolumes/issues/1?email_source=notifications&email_token=AALMFRQ3BTLOTB3HFBJH4Q3REVAY7A5CNFSM4KGBPNXKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEM4TSPY#issuecomment-590952767, or unsubscribe https://github.com/notifications/unsubscribe-auth/AALMFRSYPKISCM2YRGRV3KTREVAY7ANCNFSM4KGBPNXA .

-- Zaw Lin

stephenlombardi commented 4 years ago

Sorry, the pose transformation is a little cryptic so I'll try to explain better here. The way the code works is that it assume the volume always lives in the cube that spans -1 to 1 on each axis. This is what I'll call 'normalized space' since it's centered and has a standard size. When you provide the camera parameters of your rig, the camera extrinsics are in some arbitrary space that I'll refer to as 'camera space'. Because camera space has an arbitrary origin and scale, the object that you want to model won't necessarily fall in the [-1, 1]^3 volume. The pose transformation and world scale are how the code accounts for the difference between these two coordinate systems.

The transformation found in pose.txt transforms points from the normalized space to the camera space. The matrix is stored as a 3x4 matrix where the last column is the translation, which means that the translation column corresponds to the desired center of the volume (which should be the center of your object) in camera space. You can also adjust the rotation portion of the matrix to change the axes but getting the translation right is the most important bit so that the volume is placed correctly in space. Please let me know if that's helpful.

To disable the warp field you can add a parameter warptype=None to the Decoder constructor on line 33 of config.py.

zawlin commented 4 years ago

Ok. Got it. I will double check my camera parameters and and try without warping field for the next few days.

Worse come to worse, would you be able to take a look at the data and check on your side? I can share the original data and scripts to convert into neural volume format, including dataloaders and experiment config files for nv.

stephenlombardi commented 4 years ago

Sure I can take a look

zawlin commented 4 years ago

I managed to get it to start doing something on lincoln sequence. It turns out camera parameters were correct but the pose transformations were wrong. I was only looking at 16 cameras to do the adjustment which turns out the volume was not really overlapping the object in all cameras.

I uploaded new progress images under the same folder and also a zipped folder named lincoln.tar. Can you take a look and see if it looks like it's going well and I only need to wait?

Edit: After waiting one night, seems alright although it trains slower than synthetic data, iteration-wise. Again, thanks for all the clarifications!

stephenlombardi commented 4 years ago

I'm guessing you'll have some artifacts in the result given how it's trying to reconstruct so much of the background. I'm a little surprised since it should be easy for it to figure out that that area should be transparent, although sometimes it can get stuck in bad situations early on and it can have trouble recovering. I would recommend rendering a video of the current result with the render.py script to check it's not doing something too crazy.

zawlin commented 4 years ago

Hmm..something crazy is indeed happening :( I zipped up the entire folder with data and experiments and sent you a link with via email. I have also uploaded reconstruction from another method under the given test trajectories so that you know what "ground truth" is supposed to look like.

I am also unable to get it working on the other dataset. Whenever it looks like it's gonna do something, alphapr suddenly goes to zero and kldiv starts to increase alot and then i will just get background, then it will repeat the process in a loop. I am checking if I can share this data. Do you think if I just share one frame, it should be sufficient to debug?

Since it looks like 3d volume is rotating fine, I guess camera parameters are ok? But based on the test video(and comparison with our result video), maybe the volume is clipping the object since the rendered result look like it's shifted down about half?

stephenlombardi commented 4 years ago

Sharing one frame to debug should work. I will take a look at the lincoln data and see if I can figure out what's happening.

stephenlombardi commented 4 years ago

I got the lincoln example working, attached are the dataset class, config file, and modified pose.txt (although I didn't change pose.txt much). Let me know if this works for you experiment1.zip

zawlin commented 4 years ago

I got it working as well. Thanks a lot! Looks like I forgot to rescale the intrinsics. I believe it should work for the other dataset as well.

Edit: Yup it's working for both datasets.

zawlin commented 4 years ago

I have one more question. In the figure where you showed latent code interpolation, do you used all the frames in the training data? Say you have frame 1-5 in training data, and during testing, you used encoder to get frame 1 and frame 5's latent code and interpolate them to get frame 2,3,4?

stephenlombardi commented 4 years ago

I'm a little confused by your question, in your example if we interpolate the encodings of frame 1 and frame 5 we won't exactly reproduce the frames between them. This is particularly true if we interpolate distant frames in the sequence, which is the case for Fig. 8 in the neural volumes paper.

zawlin commented 4 years ago

Sorry for being unclear, I was trying to do "slowmo" effect and by subsampling training frames and trying to render frames which are in between(but not in training), I am trying to see if neural volume encoding produce anything reasonable in terms of time.

But I did a bit more tests and found that I can't really do slowmo effect on neural volume encodings. I am not sure if what I am doing is correct. Can you double check the result? I have the code to do the encoding interpolation and the result on full frames(training use all frames) and slowed frames(render more frames than in the trainining data). This is just to double confirm that I am doing the right thing. I think this result is sort of expected as there's no constraint on the latent space.

stephenlombardi commented 4 years ago

I took a look at the result and I think what you're seeing is expected. It's partly a limitation of this model which uses an inverse warp to model motion rather than a forward warp, which makes some motion interpolation difficult. It is also somewhat dependent on the data. I've noticed that if I train a very long sequence it does a much better job interpolating the latent space than a short sequence.

zawlin commented 4 years ago

How long is long? I can try to capture longer sequences and check.

stephenlombardi commented 4 years ago

We've captured ~7500 frames of facial data and found it works pretty well with that, the data is very redundant though, which I think helps. I think this model has a harder time with bodies since they have more complex motion.

gmzang commented 4 years ago

Hi, Can you guys please also share the KRT file for lincoln data?? I am still confused to set it correctly for my own data.. Any hint or reference for the format in KRT is appreciated. Thanks.

stephenlombardi commented 4 years ago

KRT.txt The KRT file is a series of camera specifications, each camera is specified in the following way: [camera name] K00 K01 K02 K10 K11 K12 K20 K21 K22 D0 D1 D2 D3 D4 R00 R01 R02 T0 R10 R11 R12 T1 R20 R21 R22 T2 [blank line]

where K is the intrinsic matrix, D are the distortion coefficients, R is the rotation matrix, T is the translation. However, you don't need to write a KRT file at all, you can simply write a new dataset class by making a copy of dryice1.py and loading the camera data however you like.

visonpon commented 4 years ago

@stephenlombardi Thanks for sharing this wonderful work, after reading the above discussion, I still have some problems with how to train with my own datasets. So the first step is to get the KRT.txt and pose.txt for own datasets, the KRT.txt contains the intrinsic and extrinsic matrix which I can use some tools like colmap to get, but how to get pose.txt?

zhanglonghao1992 commented 1 year ago

@stephenlombardi Thanks for sharing this wonderful work, after reading the above discussion, I still have some problems with how to train with my own datasets. So the first step is to get the KRT.txt and pose.txt for own datasets, the KRT.txt contains the intrinsic and extrinsic matrix which I can use some tools like colmap to get, but how to get pose.txt?

@visonpon Have you figured it out?

stephenlombardi commented 1 year ago

This comment explains pose.txt: https://github.com/facebookresearch/neuralvolumes/issues/1#issuecomment-591602762

facebookresearch / neuralvolumes

Training with our own data #1