Open AAAeray opened 1 year ago
@AAAeray It seems that nan appears in training. Maybe you are using a different checkpoint for that scene? Try to train from scratch.
I get a similar error when training from scratch.
(zipnerf) PS C:\Users\Admin\Documents\zipnerf-pytorch> accelerate launch train.py --gin_configs=configs/360.gin --gin_bindings="Config.data_dir = '${DATA_DIR}'" --gin_bindings="Config.exp_name = '${EXP_NAME}'" --gin_bindings="Config.factor = 4"
2023-07-01 12:15:37: Config(dataset_loader='llff', batching='all_images', batch_size=65536, patch_size=1, factor=4, multiscale=False, multiscale_levels=4, forward_facing=False, render_path=False, llffhold=8, llff_use_all_images_for_training=False, llff_use_all_images_for_testing=False, use_tiffs=False, compute_disp_metrics=False, compute_normal_metrics=False, disable_multiscale_loss=False, randomized=True, near=0.2, far=1000000.0, exp_name='360_v2/bicycle', data_dir='data/360_v2/bicycle', vocab_tree_path=None, render_chunk_size=65536, num_showcase_images=5, deterministic_showcase=True, vis_num_rays=16, vis_decimate=0, max_steps=25000, early_exit_steps=None, checkpoint_every=5000, resume_from_checkpoint=True, checkpoints_total_limit=1, gradient_scaling=False, print_every=100, train_render_every=500, data_loss_type='charb', charb_padding=0.001, data_loss_mult=1.0, data_coarse_loss_mult=0.0, interlevel_loss_mult=0.0, anti_interlevel_loss_mult=0.01, orientation_loss_mult=0.0, orientation_coarse_loss_mult=0.0, orientation_loss_target='normals_pred', predicted_normal_loss_mult=0.0, predicted_normal_coarse_loss_mult=0.0, hash_decay_mults=0.1, lr_init=0.01, lr_final=0.001, lr_delay_steps=5000, lr_delay_mult=1e-08, adam_beta1=0.9, adam_beta2=0.99, adam_eps=1e-15, grad_max_norm=0.0, grad_max_val=0.0, distortion_loss_mult=0.005, opacity_loss_mult=0.0, eval_only_once=True, eval_save_output=True, eval_save_ray_data=False, eval_render_interval=1, eval_dataset_limit=2147483647, eval_quantize_metrics=True, eval_crop_borders=0, render_video_fps=60, render_video_crf=18, render_path_frames=120, z_variation=0.0, z_phase=0.0, render_dist_percentile=0.5, render_dist_curve_fn=<ufunc 'log'>, render_path_file=None, render_resolution=None, render_focal=None, render_camtype=None, render_spherical=False, render_save_async=True, render_spline_keyframes=None, render_spline_n_interp=30, render_spline_degree=5, render_spline_smoothness=0.03, render_spline_interpolate_exposure=False, rawnerf_mode=False, exposure_percentile=97.0, num_border_pixels_to_mask=0, apply_bayer_mask=False, autoexpose_renders=False, eval_raw_affine_cc=False, zero_glo=False, valid_weight_thresh=0.05, isosurface_threshold=20, mesh_voxels=134217728, visibility_resolution=512, mesh_radius=1.0, mesh_max_radius=10.0, std_value=0.0, compute_visibility=False, extract_visibility=True, decimate_target=-1, vertex_color=True, vertex_projection=True, tsdf_radius=2.0, tsdf_resolution=512, truncation_margin=5.0, tsdf_max_radius=10.0)
2023-07-01 12:15:37: Distributed environment: NO
Num processes: 1
Process index: 0
Local process index: 0
Device: cuda
Mixed precision type: fp16
Warning: image_path not found for reconstruction
C:\Users\Admin\Documents\zipnerf-pytorch\internal\datasets.py:567: RuntimeWarning: invalid value encountered in matmul
pixtocam = pixtocam @ np.diag([factor, factor, 1.])
Warning: image_path not found for reconstruction
2023-07-01 12:15:40: Checkpoint does not exist. Starting a new training run.
2023-07-01 12:15:49: Number of parameters being optimized: 77622581
2023-07-01 12:15:49: Begin training...
2023-07-01 12:22:27: NaN or Inf found in input tensor.
Training: 0%| | 0/25000 [06:37<?, ?it/s]
2023-07-01 12:22:27: Error!
Traceback (most recent call last):
File "C:\Users\Admin\Documents\zipnerf-pytorch\train.py", line 387, in <module>
app.run(main)
File "C:\Users\Admin\.conda\envs\zipnerf\lib\site-packages\absl\app.py", line 308, in run
_run_main(main, args)
File "C:\Users\Admin\.conda\envs\zipnerf\lib\site-packages\absl\app.py", line 254, in _run_main
sys.exit(main(argv))
File "C:\Users\Admin\Documents\zipnerf-pytorch\train.py", line 254, in main
summary_writer.add_histogram('train_' + k, v, step)
File "C:\Users\Admin\.conda\envs\zipnerf\lib\site-packages\tensorboardX\writer.py", line 562, in add_histogram
histogram(tag, values, bins, max_bins=max_bins), global_step, walltime)
File "C:\Users\Admin\.conda\envs\zipnerf\lib\site-packages\tensorboardX\summary.py", line 209, in histogram
hist = make_histogram(values.astype(float), bins, max_bins)
File "C:\Users\Admin\.conda\envs\zipnerf\lib\site-packages\tensorboardX\summary.py", line 247, in make_histogram
raise ValueError('The histogram is empty, please file a bug report.')
ValueError: The histogram is empty, please file a bug report.
Traceback (most recent call last):
File "C:\Users\Admin\.conda\envs\zipnerf\lib\runpy.py", line 197, in _run_module_as_main
return _run_code(code, main_globals, None,
File "C:\Users\Admin\.conda\envs\zipnerf\lib\runpy.py", line 87, in _run_code
exec(code, run_globals)
File "C:\Users\Admin\.conda\envs\zipnerf\Scripts\accelerate.exe\__main__.py", line 7, in <module>
File "C:\Users\Admin\.conda\envs\zipnerf\lib\site-packages\accelerate\commands\accelerate_cli.py", line 45, in main
args.func(args)
File "C:\Users\Admin\.conda\envs\zipnerf\lib\site-packages\accelerate\commands\launch.py", line 941, in launch_command
simple_launcher(args)
File "C:\Users\Admin\.conda\envs\zipnerf\lib\site-packages\accelerate\commands\launch.py", line 603, in simple_launcher
raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['C:\\Users\\Admin\\.conda\\envs\\zipnerf\\python.exe', 'train.py', '--gin_configs=configs/360.gin', "--gin_bindings=Config.data_dir = 'data/360_v2/bicycle'", "--gin_bindings=Config.exp_name = '360_v2/bicycle'", '--gin_bindings=Config.factor = 4']' returned non-zero exit status 1.
I'm getting the same error as @TeamMasse.
Running on Windows 11 w/ 3090Ti GPU and 3990X TR w/ 256GB RAM.
I can train all other NeRF types from scratch using my own data, but I'm struggling getting this implementation of ZipNeRF running on my machine.
Training also fails when using the 360_v2 datasets (Garden, Room, Bicycle) all provide the same crashing error.
2023-09-05 02:18:25: Distributed environment: NO
Num processes: 1
Process index: 0
Local process index: 0
Device: cuda
Mixed precision type: no
Warning: image_path not found for reconstruction
P:\04_Git\zipnerf-pytorch\internal\datasets.py:567: RuntimeWarning: invalid value encountered in matmul
pixtocam = pixtocam @ np.diag([factor, factor, 1.])
Warning: image_path not found for reconstruction
2023-09-05 02:18:30: Checkpoint does not exist. Starting a new training run.
2023-09-05 02:18:30: Number of parameters being optimized: 77622581
2023-09-05 02:18:30: Begin training...
2023-09-05 02:18:57: NaN or Inf found in input tensor.
Training: 0%| | 0/25000 [00:26<?, ?it/s]
2023-09-05 02:18:57: Error!
Traceback (most recent call last):
File "P:\04_Git\zipnerf-pytorch\train.py", line 387, in <module>
app.run(main)
File "C:\Users\Troy\anaconda3\envs\zipnerf\lib\site-packages\absl\app.py", line 308, in run
_run_main(main, args)
File "C:\Users\Troy\anaconda3\envs\zipnerf\lib\site-packages\absl\app.py", line 254, in _run_main
sys.exit(main(argv))
File "P:\04_Git\zipnerf-pytorch\train.py", line 254, in main
summary_writer.add_histogram('train_' + k, v, step)
File "C:\Users\Troy\anaconda3\envs\zipnerf\lib\site-packages\tensorboardX\writer.py", line 562, in add_histogram
histogram(tag, values, bins, max_bins=max_bins), global_step, walltime)
File "C:\Users\Troy\anaconda3\envs\zipnerf\lib\site-packages\tensorboardX\summary.py", line 210, in histogram
hist = make_histogram(values.astype(float), bins, max_bins)
File "C:\Users\Troy\anaconda3\envs\zipnerf\lib\site-packages\tensorboardX\summary.py", line 248, in make_histogram
raise ValueError('The histogram is empty, please file a bug report.')
ValueError: The histogram is empty, please file a bug report.
Traceback (most recent call last):
File "C:\Users\Troy\anaconda3\envs\zipnerf\lib\runpy.py", line 197, in _run_module_as_main
return _run_code(code, main_globals, None,
File "C:\Users\Troy\anaconda3\envs\zipnerf\lib\runpy.py", line 87, in _run_code
exec(code, run_globals)
File "C:\Users\Troy\anaconda3\envs\zipnerf\Scripts\accelerate.exe\__main__.py", line 7, in <module>
File "C:\Users\Troy\anaconda3\envs\zipnerf\lib\site-packages\accelerate\commands\accelerate_cli.py", line 45, in main
args.func(args)
File "C:\Users\Troy\anaconda3\envs\zipnerf\lib\site-packages\accelerate\commands\launch.py", line 986, in launch_command
simple_launcher(args)
File "C:\Users\Troy\anaconda3\envs\zipnerf\lib\site-packages\accelerate\commands\launch.py", line 628, in simple_launcher
raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['C:\\Users\\Troy\\anaconda3\\envs\\zipnerf\\python.exe', 'train.py', '--gin_configs=configs/360.gin', "--gin_bindings=Config.data_dir = 'P:/04_Git/zipnerf-pytorch/data/360_v2/garden'", "--gin_bindings=Config.exp_name = 'P:/04_Git/zipnerf-pytorch/output/360_v2/garden'", '--gin_bindings=Config.factor = 4']' returned non-zero exit status 1.
I also had this problem. If the Config.exp_name folder in the .gin file already exists, it seems to automatically load the already learned checkpoints in this folder. In my case, I changed Config.exp_name and learned from scratch, and there were no problems.
When resume training from checkpoint,it has such error. 2023-06-25 17:16:24: Error! Traceback (most recent call last): File "/home/zengxr/project/zipnerf_v2/train.py", line 387, in
app.run(main)
File "/home/zengxr/anaconda3/envs/multinerf/lib/python3.9/site-packages/absl/app.py", line 308, in run
_run_main(main, args)
File "/home/zengxr/anaconda3/envs/multinerf/lib/python3.9/site-packages/absl/app.py", line 254, in _run_main
sys.exit(main(argv))
File "/home/zengxr/project/zipnerf_v2/train.py", line 254, in main
summary_writer.addhistogram('train' + k, v, step)
File "/home/zengxr/anaconda3/envs/multinerf/lib/python3.9/site-packages/tensorboardX/writer.py", line 562, in add_histogram
histogram(tag, values, bins, max_bins=max_bins), global_step, walltime)
File "/home/zengxr/anaconda3/envs/multinerf/lib/python3.9/site-packages/tensorboardX/summary.py", line 209, in histogram
hist = make_histogram(values.astype(float), bins, max_bins)
File "/home/zengxr/anaconda3/envs/multinerf/lib/python3.9/site-packages/tensorboardX/summary.py", line 247, in make_histogram
raise ValueError('The histogram is empty, please file a bug report.')
ValueError: The histogram is empty, please file a bug report.