SuLvXiangXin / zipnerf-pytorch

Unofficial implementation of ZipNeRF
Apache License 2.0
808 stars 89 forks source link

Warnings and _pickle.UnpicklingError: pickle data was truncated #52

Open karensylee opened 1 year ago

karensylee commented 1 year ago

I'm trying to train on the bicycle dataset. I ran into issues similar to those described in these posts: https://github.com/SuLvXiangXin/zipnerf-pytorch/issues/49#issuecomment-1615760166 and https://github.com/SuLvXiangXin/zipnerf-pytorch/issues/27#issuecomment-1614872248
I applied the suggested fixes but now I have a new error: "_pickle.UnpicklingError: pickle data was truncated"

(zipnerf) PS C:\zipnerf-pytorch> accelerate launch train.py --gin_configs=configs/360.gin --gin_bindings="Config.data_dir = 'data/bicycle'" --gin_bindings="Config.exp_name = 'exp/360_v2/bicycle'" --gin_bindings="Config.render_chunk_size = 8192" --gin_bindings="Config.batch_size = 8192"
2023-07-02 18:31:31: Config(dataset_loader='llff', batching='all_images', batch_size=8192, patch_size=1, factor=4, multiscale=False, multiscale_levels=4, forward_facing=False, render_path=False, llffhold=8, llff_use_all_images_for_training=False, llff_use_all_images_for_testing=False, use_tiffs=False, compute_disp_metrics=False, compute_normal_metrics=False, disable_multiscale_loss=False, randomized=True, near=0.2, far=1000000.0, exp_name='exp/360_v2/bicycle', data_dir='data/bicycle', vocab_tree_path=None, render_chunk_size=8192, num_showcase_images=5, deterministic_showcase=True, vis_num_rays=16, vis_decimate=0, max_steps=25000, early_exit_steps=None, checkpoint_every=5000, resume_from_checkpoint=True, checkpoints_total_limit=1, gradient_scaling=False, print_every=100, train_render_every=500, data_loss_type='charb', charb_padding=0.001, data_loss_mult=1.0, data_coarse_loss_mult=0.0, interlevel_loss_mult=0.0, anti_interlevel_loss_mult=0.01, orientation_loss_mult=0.0, orientation_coarse_loss_mult=0.0, orientation_loss_target='normals_pred', predicted_normal_loss_mult=0.0, predicted_normal_coarse_loss_mult=0.0, hash_decay_mults=0.1, lr_init=0.01, lr_final=0.001, lr_delay_steps=5000, lr_delay_mult=1e-08, adam_beta1=0.9, adam_beta2=0.99, adam_eps=1e-15, grad_max_norm=0.0, grad_max_val=0.0, distortion_loss_mult=0.005, opacity_loss_mult=0.0, eval_only_once=True, eval_save_output=True, eval_save_ray_data=False, eval_render_interval=1, eval_dataset_limit=2147483647, eval_quantize_metrics=True, eval_crop_borders=0, render_video_fps=60, render_video_crf=18, render_path_frames=120, z_variation=0.0, z_phase=0.0, render_dist_percentile=0.5, render_dist_curve_fn=<ufunc 'log'>, render_path_file=None, render_resolution=None, render_focal=None, render_camtype=None, render_spherical=False, rosure=False, rawnerf_mode=False, exposure_percentile=97.0, num_border_pixels_to_mask=0, apply_bayer_mask=False, autoexpose_renders=False, eval_raw_affine_cc=False, zero_glo=False, valid_weight_thresh=0.05, isosurface_threshold=20, mesh_voxels=134217728, visibility_resolution=512, mesh_radius=1.0, mesh_max_radius=10.0, std_value=0.0, compute_visibility=False, extract_visibility=True, decimate_target=-1, vertex_color=True, vertex_projection=True, tsdf_radius=2.0, tsdf_resolution=512, truncation_margin=5.0, tsdf_max_radius=10.0)
2023-07-02 18:31:31: Distributed environment: NO
Num processes: 1
Process index: 0
Local process index: 0
Device: cuda

Mixed precision type: fp16

Warning: image_path not found for reconstruction
C:\zipnerf-pytorch\internal\datasets.py:567: RuntimeWarning: invalid value encountered in matmul
  pixtocam = pixtocam @ np.diag([factor, factor, 1.])
Warning: image_path not found for reconstruction
C:\Users\Karen\anaconda3\envs\zipnerf\lib\site-packages\torch\utils\data\dataloader.py:560: UserWarning: This DataLoader will create 8 worker processes in total. Our suggested max number of worker in current system is 4 (`cpuset` is not taken into account), which is smaller than what this DataLoader is going to create. Please be aware that excessive worker creation might get DataLoader running slow or even freeze, lower the worker number to avoid potential slowness/freeze if necessary.
  warnings.warn(_create_warning_msg(
C:\Users\Karen\anaconda3\envs\zipnerf\lib\site-packages\torch\utils\data\dataloader.py:560: UserWarning: This DataLoader will create 8 worker processes in total. Our suggested max number of worker in current system is 4 (`cpuset` is not taken into account), which is smaller than what this DataLoader is going to create. Please be aware that excessive worker creation might get DataLoader running slow or even freeze, lower the worker number to avoid potential slowness/freeze if necessary.
  warnings.warn(_create_warning_msg(
2023-07-02 18:31:46: Checkpoint does not exist. Starting a new training run.
2023-07-02 18:32:02: Number of parameters being optimized: 77622581
2023-07-02 18:32:02: Begin training...
Training:   0%|                                                                                                                          | 0/25000 [00:00<?, ?it/s]Traceback (most recent call last):
Training:   0%|                                                                                                                          | 0/25000 [01:07<?, ?it/s]  File "<string>", line 1, in <module>

  File "C:\Users\Karen\anaconda3\envs\zipnerf\lib\multiprocessing\spawn.py", line 116, in spawn_main
    exitcode = _main(fd, parent_sentinel)
  File "C:\Users\Karen\anaconda3\envs\zipnerf\lib\multiprocessing\spawn.py", line 126, in _main
    self = reduction.pickle.load(from_parent)
_pickle.UnpicklingError: pickle data was truncated
2023-07-02 18:33:10: Error!
Traceback (most recent call last):
  File "C:\zipnerf-pytorch\train.py", line 387, in <module>
    app.run(main)
  File "C:\Users\Karen\anaconda3\envs\zipnerf\lib\site-packages\absl\app.py", line 308, in run
    _run_main(main, args)
  File "C:\Users\Karen\anaconda3\envs\zipnerf\lib\site-packages\absl\app.py", line 254, in _run_main
    sys.exit(main(argv))
  File "C:\zipnerf-pytorch\train.py", line 144, in main
    batch = next(dataiter)
  File "C:\Users\Karen\anaconda3\envs\zipnerf\lib\site-packages\accelerate\data_loader.py", line 374, in __iter__
    dataloader_iter = super().__iter__()
  File "C:\Users\Karen\anaconda3\envs\zipnerf\lib\site-packages\torch\utils\data\dataloader.py", line 436, in __iter__
    self._iterator = self._get_iterator()
  File "C:\Users\Karen\anaconda3\envs\zipnerf\lib\site-packages\torch\utils\data\dataloader.py", line 388, in _get_iterator
    return _MultiProcessingDataLoaderIter(self)
  File "C:\Users\Karen\anaconda3\envs\zipnerf\lib\site-packages\torch\utils\data\dataloader.py", line 1042, in __init__
    w.start()
  File "C:\Users\Karen\anaconda3\envs\zipnerf\lib\multiprocessing\process.py", line 121, in start
    self._popen = self._Popen(self)
  File "C:\Users\Karen\anaconda3\envs\zipnerf\lib\multiprocessing\context.py", line 224, in _Popen
    return _default_context.get_context().Process._Popen(process_obj)
  File "C:\Users\Karen\anaconda3\envs\zipnerf\lib\multiprocessing\context.py", line 327, in _Popen
    return Popen(process_obj)
  File "C:\Users\Karen\anaconda3\envs\zipnerf\lib\multiprocessing\popen_spawn_win32.py", line 93, in __init__
    reduction.dump(process_obj, to_child)
    ForkingPickler(file, protocol).dump(obj)
OSError: [Errno 22] Invalid argument
Traceback (most recent call last):
  File "C:\Users\Karen\anaconda3\envs\zipnerf\lib\runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "C:\Users\Karen\anaconda3\envs\zipnerf\lib\runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "C:\Users\Karen\anaconda3\envs\zipnerf\Scripts\accelerate.exe\__main__.py", line 7, in <module>
  File "C:\Users\Karen\anaconda3\envs\zipnerf\lib\site-packages\accelerate\commands\accelerate_cli.py", line 45, in main
    args.func(args)
  File "C:\Users\Karen\anaconda3\envs\zipnerf\lib\site-packages\accelerate\commands\launch.py", line 941, in launch_command
    simple_launcher(args)
  File "C:\Users\Karen\anaconda3\envs\zipnerf\lib\site-packages\accelerate\commands\launch.py", line 603, in simple_launcher
    raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['C:\\Users\\Karen\\anaconda3\\envs\\zipnerf\\python.exe', 'train.py', '--gin_configs=configs/360.gin', "--gin_bindings=Config.data_dir = 'data/bicycle'", "--gin_bindings=Config.exp_name = 'exp/360_v2/bicycle'", '--gin_bindings=Config.render_chunk_size = 8192', '--gin_bindings=Config.batch_size = 8192']' returned non-zero exit status 1.
SuLvXiangXin commented 1 year ago

C:\zipnerf-pytorch\internal\datasets.py:567: RuntimeWarning: invalid value encountered in matmul Maybe something wrong with your data.

KotaYonezawa commented 1 year ago

Hi All, It seems Windows PC was used for training in this case. I'm using Windows PC too and faced similar error. (ForkingPickler(file, protocol).dump(obj) OSError: [Errno 22] Invalid argument)

I searched using google and it seems that it related to pytorch issue, only for Windows. https://hashicco.hatenablog.com/entry/2023/03/07/224638 https://github.com/pytorch/pytorch/issues/12831

To avoid error, I modified train.py a little.

    dataloader = torch.utils.data.DataLoader(np.arange(len(dataset)),
                                             num_workers=0,#8,
                                             shuffle=True,
                                             batch_size=1,
                                             collate_fn=dataset.collate_fn,
                                             persistent_workers=False,#True,
                                             )
    test_dataloader = torch.utils.data.DataLoader(np.arange(len(test_dataset)),
                                                  num_workers=0,#4,
                                                  shuffle=False,
                                                  batch_size=1,
                                                  persistent_workers=False,#True,
                                                  collate_fn=test_dataset.collate_fn,
                                                  )

Like above, set num_workers to 0, and parsistent_workers to False. After that, I can run training on Windows PC now.

I hope this information helps you...