facebookresearch / hyperreel

Code release for HyperReel: High-Fidelity 6-DoF Video with Ray-Conditioned Sampling
MIT License
472 stars 34 forks source link

Command and checkpoints for reproducing results in the paper #5

Closed hangg7 closed 1 year ago

hangg7 commented 1 year ago

Hi Ben,

Thanks for releasing your code!

I am wondering if you are going to release the pretrained models? It will make it much easier to play with the repo!

On the other hand, what is the command that I should use to, say, reproduce your numbers and results in the neural-3d-video dataset? You mentioned here that we should specify the "start frame", but to what value? I would assume 0, but it will be nice for the author to specify it.

Finally, I noticed that for the neural-3d-video dataset, you seem to use a subset of frames instead of the full video. In that case, is the comparison study you showed in the paper still valid? It seems to me that the original neural-3d-video paper use the full res + full video for training and evaluation.

Thanks!

benattal commented 1 year ago

Hi! We plan to publicly release the models at a later date. If there are specific ones that you'd like to mess around with, let me know and I'll see what I can do :)

In order to reproduce the numbers and results for our Neural 3D Video comparison: as specified in the paper, we split each video into a set of several 50 frame chunks, which span the entire video. So, to produce the results for a single sequence (flame_steak, for example), we averaged the metrics for all chunks, e.g. start frame = 0, 50, 100, ..., 250

With regards to image resolution, we follow the setting in NeRFPlayer, which is also the setting that the original Neural 3D Video paper uses. This is a snippet from NeRFPlayer: image

And the corresponding snippet from the Neural 3D Video paper, confirming that evaluation is performed at 1K resolution: image

Anyway, thanks for your questions, and I hope this was helpful! Feel free to follow up privately about the models if you'd like.

Best, Ben

benattal commented 1 year ago

Forgot to answer the part about what commands to run to reproduce the results. For neural 3D video, to produce results for a single scene, you'll want to run:

bash scripts/run_one_n3d.sh <gpu_to_use> <scene> <start_frame>

For start_frame = 0, 50, etc, and then average the metrics for each of these models.

The commands to use for other datasets are specified in the README.

benattal commented 1 year ago

Just wanted to quickly follow up here --- was I able to answer all of your questions w/ regards to reproducing results? If not, let me know. I'm happy to try to assist however I can.

hangg7 commented 1 year ago

Thanks for your prompt clarification! It makes a lot of sense (sorry that I didn't receive any notifcation that you have replied). My questions are all answered and I am closing the issue here. Thanks for your great work!

hangg7 commented 1 year ago

A quick follow-up question: You mentioned to split the training video into chunks, is there any particular reason for that? Also, since it is chunked, does that mean the training time scales up linearly w.r.t. to the frames? I remember you said the training time is about 1.5h, does that include the training times of all chunks?

benattal commented 1 year ago

I'm not sure if this is exactly what you're asking but: we don't have an ablation for chunk size, and use 50 frame chunks for all dynamic datasets. Or was there something else you were getting at with your first question?

And yes, that's correct --- we train every model for ~1.5 hours, which for a dynamic sequence means training the model for each 50 frame subset for 1.5 hours (e.g. 9 hours total for a 10 second video from Neural 3D). Since it's not immediately obvious from the text I'll make a note in the README, and be sure to clarify this point in any future revisions of the paper (as it pertains to both the Neural 3D and Google Immersive datasets).

hangg7 commented 1 year ago

That makes sense. About the first question, I meant "reason" instead pf "result" (oops it's a typo). I would assume it has something to do with the model size?

benattal commented 1 year ago

Yep --- both (1) model size (as explained on the end of page 5, if the number of keyframes is small relative to the spatial resolution of the volume, then the model size is comparable to a static TensoRF), and (2) RAM constraints based on how our dataloaders are implemented / training procedure is implemented, when loading lots and lots of frames for training of longer video sequences.

No doubt there's a way, with better engineering, to avoid loading the entire dataset (all frames, all rays for each frame) into memory at the beginning of training, but we just haven't gotten around to it yet.