facebookresearch / hyperreel

Code release for HyperReel: High-Fidelity 6-DoF Video with Ray-Conditioned Sampling
MIT License
472 stars 34 forks source link

Tips for handling the code for monocular dataset. #19

Open JihyongOh opened 1 year ago

JihyongOh commented 1 year ago

Hi, Thanks for sharing such awesome work and nicely organized code! I want to study this large-scale code as a baseline network (framework) in detail, and then explore ideas for casually captured monocular videos. I have a few questions as follows:

  1. This large code seems to be implemented based on PyTorch Lightning. Was this code developed from scratch? If so, could you provide some tips/guidelines or links to help me understand the overall flow of this code in detail? Any brief explanation of an outline, a rule, or how to debug for organizing this large-scale code would greatly help me in my studies.

  2. If I want to test HyperReel on the Neural 3D Video dataset with a monocular setting (e.g., only using camera1 among 20 cameras for 50 frames or all 300 frames at once), how can I modify a config or a YAML file associated with "scripts/run_one_n3d.sh"?

  3. If my own monocular video (forward-facing dataset) is provided as extracted frames (.png, not video .mp4) with bose_bound.npy, how can I handle this dataset in this code structure for training HyperReel (any suggestion for referring to YAML/Config file)? Do I have to convert the monocular video into a .mp4 format?

  4. What does "hold_out" mean? (hold_out vs. no_hold_out)

Thank you very much!

benattal commented 1 year ago

Hi! Sorry for taking a while to get to this -- I wanted to let you know that I'm working on a longer-form response to your questions (especially the first one, as I think it's relevant to everyone using the repo), and will try to follow-up within the next couple of days. I hope that's okay!

JihyongOh commented 1 year ago

@breuckelen Hello, no problem at all, I completely understand that you might need some time due to my detailed questions. I appreciate your willingness to provide a comprehensive answer, and I'm looking forward to reading it. Take your time and thanks a lot!

benattal commented 1 year ago

As mentioned in the README, the codebase was originally extended from nerf_pl and Neural Light Fields, but at this point its structure deviates fairly significantly from that of nerf_pl. I'll try to break it down (as best I can) below.

Basic Structure

At a high level, the optimization procedure, implemented in nlf/__init__.py, requires

  1. A dataset (everything under datasets/), which produces a set of training rays, and their corresponding ground truth colors
  2. A model (everything under nlf/model), which maps a ray to a predicted color
  3. A set of regularizers (everything under nlf/regularizers), which implement auxiliary losses applied to the model (e.g. total variation, sparsity, etc.).

The dataset, model, regularizers, and additional training hyper-parameters (like learning rates for different model parameters, optimizers, weight initialization strategy) are constructed from configurations under the conf/ folder, and can be specified via the command line. For example:

python main.py experiment/dataset=<dataset_config> \
    experiment/training=<training_config> \
    experiment/model=<model_config> \
    experiment.dataset.collection=<scene_name> \
    +experiment/regularizers/tensorf=tv_4000

Note that if a specific configuration property z lives in the directory conf/experiment/x/y, then you can overwrite it with experiment.x.y.z=blah. You can also change default configuration in conf/experiment/local.yaml.

Datasets

As mentioned above, various datasets that produce training rays and colors are implemented in the datasets/ folder.

The base datasets for static scenes and dynamic scenes are Base5DDataset and Base6DDataset, respectively, in dataset/base.py. These base classes are pretty bare-bones, and there is unfortunately a lot of re-implemented boilerplate in each specific subclass (for example, compare datasets/technicolor.py and datasets/neural_3d.py). In general dataset classes operate as follows during training:

  1. They load some meta information about the dataset in the read_meta function (e.g. image file names, camera poses, the number of frames in a video sequence)
  2. They prepare per-ray inputs / ground truth outputs in the prepare_training_data function. This typically involves creating a ray for every camera, every pixel in that camera, and every time-step, as well as the colors corresponding to these rays.
  3. They collate per-ray inputs in the update_all_data function into a single array, for more efficient loading.
  4. The per-ray inputs are accessed in a random order via the __get_item__ function at each training step

Each dataset class can also be used as a validation, testing, or render dataset_, by changing the split flag. For validation/testing datasets, the above process is much the same, except that step 3 (collating per-ray inputs) is skipped, and the dataset returns each individual per-ray input in the __get_item__ function. For render datasets, the dataset creates a novel camera trajectory in prepare_render_data, and does not return ground truth per-ray outputs (colors).

The configurations for different datasets can be found in the conf/experiment/dataset folder. Some important config options are:

  1. name: this specifies the type of dataset --- in other words, the dataset class used. You can add new dataset classes by modifying the dataset_dict in datasets/__init__.py
  2. collection: this specifies the particular scene used within the dataset
  3. root_dir: the path to the scene, which defaults to <data_root_dir>/<data_subdir>/<collection>

Models

A model maps a ray to a predicted color. In this project, every model consists of two components:

  1. An embedding model, which re-maps a ray (defined by origin, direction) into some intermediate coordinate space. This might be some high dimensional latent space (as in "Neural Light Fields"), or a set of sample points along the ray defined by ray-primitive intersections (as in this project).
  2. A color model, which maps rays in the intermediate coordinate space to a color. This might be a single-evaluation MLP, or a volumetric scene representation like TensoRF that takes sample points and performs volume rendering.

All model configs (in conf/experiment/models) will typically contain the following lines:

# @package _group_

type: lightfield

render:
  type: lightfield

param:
  fn: identity

embedding:
    type: ...

color:
   type: ...

Like the name variable in the dataset configurations, the type variable for embedding and color specify the specific model to use in nlf/embedding/__init__.py and nlf/nets/__init__.py. I'll discuss more about how to define / extend your own embedding and color models below.

Embedding Models

In this project, every embedding model is a RayPointEmbedding (defined in nlf/embedding/embedding.py), combining a sequence of ray-dependent operations (like mapping a ray to a set of sample points), and point-dependent operations (like adding a set of point offsets to each sample point). It's very simple to compose a sequence of arbitrary operations in a model config file, with the following syntax:

embedding:
    type: ray_point

   embeddings:
       op0:
           type: ...

       op1:
           type: ...

       ...

Above, op1, op2, etc. can be arbitrary keys, and all of these operations are applied in sequence. The type variable for each operation specifies which class in the embedding_dict variable in nlf/embedding/__init__.py to use.

The ray-dependent operations are defined in nlf/embedding/ray.py. We use RayPredictionEmbedding, which maps a ray (origin and direction) to a set of per-ray outputs, like parameters for geometric primitives. With configuration files it's straightforward to specify which "parameterizations" to apply to the input ray (two-plane, Pluecker, etc.), what positional encoding to apply to the ray, the type of MLP to use, the shape and name of each per-ray output. For example, the donerf model config looks like this:

    ray_prediction_0:
      type: ray_prediction

      # Parameterization
      params:
        ray:
          start: 0
          end: 6

          param:
            n_dims: 6
            fn: pluecker
            direction_multiplier: 1.0
            moment_multiplier: 1.0

          pe:
            type: windowed
            freq_multiplier: 2.0
            n_freqs: 1
            wait_iters: 0
            max_freq_epoch: 0
            exclude_identity: False

      # Net
      net:
        type: base
        group: embedding_impl

        depth: 6
        hidden_channels: 256
        skips: [3]

      # Outputs
      z_channels: 32

      outputs:
        z_vals:
          channels: 4

        sigma:
          channels: 1

          activation:
            type: ease_value
            start_value: 1.0
            window_epochs: 3
            wait_epochs: 0

            activation:
              type: sigmoid
              shift: 4.0

        point_sigma:
          channels: 1

          activation:
            type: ease_value
            start_value: 1.0
            window_epochs: 3
            wait_epochs: 1

            activation:
              type: sigmoid
              shift: 4.0

        point_offset:
          channels: 3

          activation:
            type: tanh
            outer_fac: 0.125

        color_scale:
          channels: 3

          activation:
            type: ease_value
            start_value: 0.0
            window_epochs: 0
            wait_epochs: 0

            activation:
              type: identity
              shift: 0.0
              inner_fac: 1.0
              outer_fac: 1.0

        color_shift:
          channels: 3

          activation:
            type: ease_value
            start_value: 0.0
            window_epochs: 0
            wait_epochs: 0

            activation:
              type: identity
              shift: 0.0
              inner_fac: 1.0
              outer_fac: 1.0

which applies (1) a Pluecker parameterization to the ray, (2) positional encoding with 1 frequency to the Pluecker-parameterized ray, which is fed through a (3) 6 layer, 256 hidden unit MLP, and (4) outputs geometric primitive parameters (z_vals), point offsets (point_offset), as well as some other things for 32 two different sample points (z_channels: 32).

We also use RayIntersectEmbedding, which intersects a ray with a set of geometric primitives, producing sample points for that ray. Various intersect methods are defined in the nlf/intersect/ folder. We use axis-aligned z planes and spheres in our work, but we also define intersect methods for voxel grids, non-axis-aligned planes, and a few others. You can also extend these or define your own.

The main point dependent operation that we use is PointOffsetEmbedding in nlf/embedding/point.py, which simply adds point offsets to each generated sample point, modulated by a set of per-sample-point weights. For dynamic scenes, we also use the AdvectPoints embedding, which advects each sample point into the nearest keyframe using per-sample-point flows that are output by the RayPredictionEmbedding.

Color Models

Color models typically map a set of sample points (generated via an embedding model) to a color using volume rendering on some underlying volumetric scene representation. We implement a few TensoRF-based models for static and dynamic scenes in nlf/nets, but in principle it should be easy to add your own. I am currently messing around with Instant-NGP, and some other models from nerfacc. Feel free to DM me if you're interested in using these implementations, which I haven't yet integrated into the public repo.

Because the implementation for each color model is pretty much standalone (they're not designed to be composable at the moment and are implemented independently of one another), I won't go into too much detail here. If you have any questions about our models (e.g. the keyframe-based TensoRF model that we use), or about how to implement your own, feel free to follow-up.

Regularizers

The regularizer classes, implemented in nlf/regularizers, are a way for us to add auxiliary losses to the model, apart from the typical color loss (like total variation, sparsity -- or in the monocular case, perhaps monocular depth losses and flow losses). Defining a regularizer is pretty straightforward. Just extend the base regularizer class in nlf/regularizers/base.py, implement the _loss(...) function, and add your class to the regularizer_dict in nlf/regularizers/__init__.py.

In order to make the regularizer accessible via the command line, you should create a new folder in nlf/experiment/regularizers for your specific regularizer type. You can put any number of configurations for this regularizer within this folder, and then use the regularizer by adding the following to the command line:

+experiment/regularizers/<folder_name>=<config_name>

The Training Loop

Ideally, you won't have to modify the core training loop in nlf/__init__.py too much. It's designed to be pretty general-purpose. However, it still might be useful to understand a couple of things about how it works:

  1. All training / optimization settings are defined with configurations in the conf/experiment/training folder, where you can specify how to sample from your dataloader (e.g. with or without replacement), number of epochs, optimizer, learning rate, decay rate, etc.
  2. You can create any number of named optimizers (with different learning rates, decay rates, etc.), and assign arbitrary parts of the model (specific parts of the embedding, color model, etc.) to a specific optimizer by setting the opt_group property of a class.
  3. You can add/remove any number of regularizers on the command line, by simply adding +experiment/regularizers/<x>=<config_name> for each regularizer x.

Extending HyperReel to Monocular Sequences

If you're interested in using HyperReel for monocular sequences, this would probably require:

  1. Writing your own custom dataset class extended from Base6DDataset
  2. Creating a new model configuration file in conf/experiments/models, though you should be able to use any existing dynamic model configuration as a starting point
  3. Writing your own custom regularizers extended from BaseRegularizer for the monocular setting. Note that if you require something like per-ray optical flow for regularization (from an image in time t to time t+1), then you can make these flows accessible from your dataset class, by appending them to the other per-ray inputs (origins, directions, times). As an example, consider the donerf dataset in datasets/donerf.py, which makes ground truth depth accessible (although we do not use it).

Summary / Conclusion

I realize that this is a lot of info, and I apologize for the fact that it's a little disorganized at the moment. In terms of extending HyperReel, I recommend following the blueprint in the section above, and referencing other parts of this post as necessary. And of course please follow-up if you have any additional questions. I'll do my best to answer them in a timely manner.

benattal commented 1 year ago

Let me also try to answer your questions (2) and (4) here:

(2) You can change the val_set parameter in conf/experiment/dataset/neural_3d.yaml to specify which cameras to use for validation (all others will be used for training). You can also specify the number of frames to use here.

(4) the no_holdout scripts are used to train models with every view --- we do not use these models for quantitative results, but do use them for some of the demo videos, where it doesn't necessarily make sense to holdout views (you want to use all of the data available to you for the best qualitative view synthesis results).

JihyongOh commented 1 year ago

@breuckelen Apologies for the delayed response, as I was dealing with a personal matter. Huge thank you for your detailed guidance and explanations! This will greatly assist me in my research and studies. If I encounter any difficulties later on, I'll be sure to ask additional questions. :)

yavon818 commented 1 year ago

ulties later on, I'll be sure to a

@JihyongOh I wonder if you have sucessfully run the code on your monocular video dataset? how about the performance?