SuLvXiangXin / zipnerf-pytorch

Unofficial implementation of ZipNeRF
Apache License 2.0
806 stars 89 forks source link

ValueError: The histogram is empty, please file a bug report. #45

Open AAAeray opened 1 year ago

AAAeray commented 1 year ago

When resume training from checkpoint,it has such error. 2023-06-25 17:16:24: Error! Traceback (most recent call last): File "/home/zengxr/project/zipnerf_v2/train.py", line 387, in app.run(main) File "/home/zengxr/anaconda3/envs/multinerf/lib/python3.9/site-packages/absl/app.py", line 308, in run _run_main(main, args) File "/home/zengxr/anaconda3/envs/multinerf/lib/python3.9/site-packages/absl/app.py", line 254, in _run_main sys.exit(main(argv)) File "/home/zengxr/project/zipnerf_v2/train.py", line 254, in main summary_writer.addhistogram('train' + k, v, step) File "/home/zengxr/anaconda3/envs/multinerf/lib/python3.9/site-packages/tensorboardX/writer.py", line 562, in add_histogram histogram(tag, values, bins, max_bins=max_bins), global_step, walltime) File "/home/zengxr/anaconda3/envs/multinerf/lib/python3.9/site-packages/tensorboardX/summary.py", line 209, in histogram hist = make_histogram(values.astype(float), bins, max_bins) File "/home/zengxr/anaconda3/envs/multinerf/lib/python3.9/site-packages/tensorboardX/summary.py", line 247, in make_histogram raise ValueError('The histogram is empty, please file a bug report.') ValueError: The histogram is empty, please file a bug report.

SuLvXiangXin commented 1 year ago

@AAAeray It seems that nan appears in training. Maybe you are using a different checkpoint for that scene? Try to train from scratch.

TeamMasse commented 1 year ago

I get a similar error when training from scratch.

(zipnerf) PS C:\Users\Admin\Documents\zipnerf-pytorch> accelerate launch train.py --gin_configs=configs/360.gin --gin_bindings="Config.data_dir = '${DATA_DIR}'" --gin_bindings="Config.exp_name = '${EXP_NAME}'" --gin_bindings="Config.factor = 4"
2023-07-01 12:15:37: Config(dataset_loader='llff', batching='all_images', batch_size=65536, patch_size=1, factor=4, multiscale=False, multiscale_levels=4, forward_facing=False, render_path=False, llffhold=8, llff_use_all_images_for_training=False, llff_use_all_images_for_testing=False, use_tiffs=False, compute_disp_metrics=False, compute_normal_metrics=False, disable_multiscale_loss=False, randomized=True, near=0.2, far=1000000.0, exp_name='360_v2/bicycle', data_dir='data/360_v2/bicycle', vocab_tree_path=None, render_chunk_size=65536, num_showcase_images=5, deterministic_showcase=True, vis_num_rays=16, vis_decimate=0, max_steps=25000, early_exit_steps=None, checkpoint_every=5000, resume_from_checkpoint=True, checkpoints_total_limit=1, gradient_scaling=False, print_every=100, train_render_every=500, data_loss_type='charb', charb_padding=0.001, data_loss_mult=1.0, data_coarse_loss_mult=0.0, interlevel_loss_mult=0.0, anti_interlevel_loss_mult=0.01, orientation_loss_mult=0.0, orientation_coarse_loss_mult=0.0, orientation_loss_target='normals_pred', predicted_normal_loss_mult=0.0, predicted_normal_coarse_loss_mult=0.0, hash_decay_mults=0.1, lr_init=0.01, lr_final=0.001, lr_delay_steps=5000, lr_delay_mult=1e-08, adam_beta1=0.9, adam_beta2=0.99, adam_eps=1e-15, grad_max_norm=0.0, grad_max_val=0.0, distortion_loss_mult=0.005, opacity_loss_mult=0.0, eval_only_once=True, eval_save_output=True, eval_save_ray_data=False, eval_render_interval=1, eval_dataset_limit=2147483647, eval_quantize_metrics=True, eval_crop_borders=0, render_video_fps=60, render_video_crf=18, render_path_frames=120, z_variation=0.0, z_phase=0.0, render_dist_percentile=0.5, render_dist_curve_fn=<ufunc 'log'>, render_path_file=None, render_resolution=None, render_focal=None, render_camtype=None, render_spherical=False, render_save_async=True, render_spline_keyframes=None, render_spline_n_interp=30, render_spline_degree=5, render_spline_smoothness=0.03, render_spline_interpolate_exposure=False, rawnerf_mode=False, exposure_percentile=97.0, num_border_pixels_to_mask=0, apply_bayer_mask=False, autoexpose_renders=False, eval_raw_affine_cc=False, zero_glo=False, valid_weight_thresh=0.05, isosurface_threshold=20, mesh_voxels=134217728, visibility_resolution=512, mesh_radius=1.0, mesh_max_radius=10.0, std_value=0.0, compute_visibility=False, extract_visibility=True, decimate_target=-1, vertex_color=True, vertex_projection=True, tsdf_radius=2.0, tsdf_resolution=512, truncation_margin=5.0, tsdf_max_radius=10.0)
2023-07-01 12:15:37: Distributed environment: NO
Num processes: 1
Process index: 0
Local process index: 0
Device: cuda

Mixed precision type: fp16

Warning: image_path not found for reconstruction
C:\Users\Admin\Documents\zipnerf-pytorch\internal\datasets.py:567: RuntimeWarning: invalid value encountered in matmul
  pixtocam = pixtocam @ np.diag([factor, factor, 1.])
Warning: image_path not found for reconstruction
2023-07-01 12:15:40: Checkpoint does not exist. Starting a new training run.
2023-07-01 12:15:49: Number of parameters being optimized: 77622581
2023-07-01 12:15:49: Begin training...
2023-07-01 12:22:27: NaN or Inf found in input tensor.
Training:   0%|                                                                                                                                                 | 0/25000 [06:37<?, ?it/s]
2023-07-01 12:22:27: Error!
Traceback (most recent call last):
  File "C:\Users\Admin\Documents\zipnerf-pytorch\train.py", line 387, in <module>
    app.run(main)
  File "C:\Users\Admin\.conda\envs\zipnerf\lib\site-packages\absl\app.py", line 308, in run
    _run_main(main, args)
  File "C:\Users\Admin\.conda\envs\zipnerf\lib\site-packages\absl\app.py", line 254, in _run_main
    sys.exit(main(argv))
  File "C:\Users\Admin\Documents\zipnerf-pytorch\train.py", line 254, in main
    summary_writer.add_histogram('train_' + k, v, step)
  File "C:\Users\Admin\.conda\envs\zipnerf\lib\site-packages\tensorboardX\writer.py", line 562, in add_histogram
    histogram(tag, values, bins, max_bins=max_bins), global_step, walltime)
  File "C:\Users\Admin\.conda\envs\zipnerf\lib\site-packages\tensorboardX\summary.py", line 209, in histogram
    hist = make_histogram(values.astype(float), bins, max_bins)
  File "C:\Users\Admin\.conda\envs\zipnerf\lib\site-packages\tensorboardX\summary.py", line 247, in make_histogram
    raise ValueError('The histogram is empty, please file a bug report.')
ValueError: The histogram is empty, please file a bug report.
Traceback (most recent call last):
  File "C:\Users\Admin\.conda\envs\zipnerf\lib\runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "C:\Users\Admin\.conda\envs\zipnerf\lib\runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "C:\Users\Admin\.conda\envs\zipnerf\Scripts\accelerate.exe\__main__.py", line 7, in <module>
  File "C:\Users\Admin\.conda\envs\zipnerf\lib\site-packages\accelerate\commands\accelerate_cli.py", line 45, in main
    args.func(args)
  File "C:\Users\Admin\.conda\envs\zipnerf\lib\site-packages\accelerate\commands\launch.py", line 941, in launch_command
    simple_launcher(args)
  File "C:\Users\Admin\.conda\envs\zipnerf\lib\site-packages\accelerate\commands\launch.py", line 603, in simple_launcher
    raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['C:\\Users\\Admin\\.conda\\envs\\zipnerf\\python.exe', 'train.py', '--gin_configs=configs/360.gin', "--gin_bindings=Config.data_dir = 'data/360_v2/bicycle'", "--gin_bindings=Config.exp_name = '360_v2/bicycle'", '--gin_bindings=Config.factor = 4']' returned non-zero exit status 1.
troybuckley commented 1 year ago

I'm getting the same error as @TeamMasse.

Running on Windows 11 w/ 3090Ti GPU and 3990X TR w/ 256GB RAM.

I can train all other NeRF types from scratch using my own data, but I'm struggling getting this implementation of ZipNeRF running on my machine.

Training also fails when using the 360_v2 datasets (Garden, Room, Bicycle) all provide the same crashing error.

2023-09-05 02:18:25: Distributed environment: NO
Num processes: 1
Process index: 0
Local process index: 0
Device: cuda

Mixed precision type: no

Warning: image_path not found for reconstruction
P:\04_Git\zipnerf-pytorch\internal\datasets.py:567: RuntimeWarning: invalid value encountered in matmul
  pixtocam = pixtocam @ np.diag([factor, factor, 1.])
Warning: image_path not found for reconstruction
2023-09-05 02:18:30: Checkpoint does not exist. Starting a new training run.
2023-09-05 02:18:30: Number of parameters being optimized: 77622581
2023-09-05 02:18:30: Begin training...
2023-09-05 02:18:57: NaN or Inf found in input tensor.
Training:   0%|                                                                              | 0/25000 [00:26<?, ?it/s]
2023-09-05 02:18:57: Error!
Traceback (most recent call last):
  File "P:\04_Git\zipnerf-pytorch\train.py", line 387, in <module>
    app.run(main)
  File "C:\Users\Troy\anaconda3\envs\zipnerf\lib\site-packages\absl\app.py", line 308, in run
    _run_main(main, args)
  File "C:\Users\Troy\anaconda3\envs\zipnerf\lib\site-packages\absl\app.py", line 254, in _run_main
    sys.exit(main(argv))
  File "P:\04_Git\zipnerf-pytorch\train.py", line 254, in main
    summary_writer.add_histogram('train_' + k, v, step)
  File "C:\Users\Troy\anaconda3\envs\zipnerf\lib\site-packages\tensorboardX\writer.py", line 562, in add_histogram
    histogram(tag, values, bins, max_bins=max_bins), global_step, walltime)
  File "C:\Users\Troy\anaconda3\envs\zipnerf\lib\site-packages\tensorboardX\summary.py", line 210, in histogram
    hist = make_histogram(values.astype(float), bins, max_bins)
  File "C:\Users\Troy\anaconda3\envs\zipnerf\lib\site-packages\tensorboardX\summary.py", line 248, in make_histogram
    raise ValueError('The histogram is empty, please file a bug report.')
ValueError: The histogram is empty, please file a bug report.
Traceback (most recent call last):
  File "C:\Users\Troy\anaconda3\envs\zipnerf\lib\runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "C:\Users\Troy\anaconda3\envs\zipnerf\lib\runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "C:\Users\Troy\anaconda3\envs\zipnerf\Scripts\accelerate.exe\__main__.py", line 7, in <module>
  File "C:\Users\Troy\anaconda3\envs\zipnerf\lib\site-packages\accelerate\commands\accelerate_cli.py", line 45, in main
    args.func(args)
  File "C:\Users\Troy\anaconda3\envs\zipnerf\lib\site-packages\accelerate\commands\launch.py", line 986, in launch_command
    simple_launcher(args)
  File "C:\Users\Troy\anaconda3\envs\zipnerf\lib\site-packages\accelerate\commands\launch.py", line 628, in simple_launcher
    raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['C:\\Users\\Troy\\anaconda3\\envs\\zipnerf\\python.exe', 'train.py', '--gin_configs=configs/360.gin', "--gin_bindings=Config.data_dir = 'P:/04_Git/zipnerf-pytorch/data/360_v2/garden'", "--gin_bindings=Config.exp_name = 'P:/04_Git/zipnerf-pytorch/output/360_v2/garden'", '--gin_bindings=Config.factor = 4']' returned non-zero exit status 1.
bring728 commented 1 year ago

I also had this problem. If the Config.exp_name folder in the .gin file already exists, it seems to automatically load the already learned checkpoints in this folder. In my case, I changed Config.exp_name and learned from scratch, and there were no problems.