Training error when run SparseGS

ZiyangYan commented 1 month ago

Every time when I want to train a scene, it will have a error in epoch 20000 as shown as below:

Optimizing output/kitchen_24_SparseGS Output folder: output/kitchen_24_SparseGS [14/09 04:53:08] Tensorboard not available: not logging progress [14/09 04:53:08] Reading camera 24/24Loading Training Cameras [14/09 04:53:08] Loading Test Cameras [14/09 04:53:13] Number of points at initialisation : 3143 [14/09 04:53:13] Training progress: 0%| | 0/30000 [00:00<?, ?it/s][20000] [14/09 04:53:14] Cannot initialize model with low cpu memory usage because accelerate was not found in the environment. Defaulting to low_cpu_mem_usage=False. It is strongly recommended to install accelerate for faster and less memory-intense model loading. You can do so with:

pip install accelerate

. /root/miniconda3/envs/sparsegs-test/lib/python3.10/site-packages/transformers/tokenization_utils_base.py:1601: FutureWarning: clean_up_tokenization_spaces was not set. It will be set to True by default. This behavior will be depracted in transformers v4.45, and will be then set to False by default. For more details check this issue: https://github.com/huggingface/transformers/issues/31884 warnings.warn( Loading pipeline components...: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████| 6/6 [00:06<00:00, 1.09s/it] [INFO] loaded SD! [14/09 04:53:22]3%|████████████████████████████████████████████████████████████████████████████████████▏ | 5/6 [00:06<00:01, 1.34s/it] Training progress: 17%|██▉ | 4990/30000 [04:35<21:37, 19.28it/s, EMA Loss=0.5989460, Total Loss=0.5922824, Local_Depth=0.2407825, Global Depth=0.3547341]Warping 0 to -0.025 dp min: 0.0 dp max: 21.14868 [14/09 04:57:50] Training progress: 17%|██▉ | 4990/30000 [04:50<21:37, 19.28it/s, EMA Loss=0.5989460, Total Loss=0.5922824, Local_Depth=0.2407825, Global Depth=0.3547341]Warping 1 to 0.975 dp min: 0.0 dp max: 12.45852 [14/09 04:58:06] Warping 2 to 1.975 dp min: 0.0 dp max: 19.36399 [14/09 04:58:23] Warping 3 to 2.975 dp min: 0.0 dp max: 11.4649 [14/09 04:58:40] Warping 4 to 3.975 dp min: 0.0 dp max: 24.43324 [14/09 04:58:56] Warping 5 to 4.975 dp min: 0.0 dp max: 13.05872 [14/09 04:59:13] Warping 6 to 5.975 dp min: 0.0 dp max: 24.84434 [14/09 04:59:30] Warping 7 to 6.975 dp min: 0.0 dp max: 24.36036 [14/09 04:59:47] Warping 8 to 7.975 dp min: 0.0 dp max: 10.88568 [14/09 05:00:04] Warping 9 to 8.975 dp min: 0.0 dp max: 24.6271 [14/09 05:00:20] Warping 10 to 9.975 dp min: 0.0 dp max: 19.45285 [14/09 05:00:37] Warping 11 to 10.975 dp min: 0.0 dp max: 23.86458 [14/09 05:00:54] Warping 12 to 11.975 dp min: 0.0 dp max: 20.08919 [14/09 05:01:10] Warping 13 to 12.975 dp min: 0.0 dp max: 17.28836 [14/09 05:01:27] Warping 14 to 13.975 dp min: 0.0 dp max: 24.52304 [14/09 05:01:44] Warping 15 to 14.975 dp min: 0.0 dp max: 19.53295 [14/09 05:02:01] Warping 16 to 15.975 dp min: 2.90754 dp max: 10.89158 [14/09 05:02:17] Warping 17 to 16.975 dp min: 0.0 dp max: 25.35812 [14/09 05:02:34] Warping 18 to 17.975 dp min: 0.0 dp max: 21.20248 [14/09 05:02:51] Warping 19 to 18.975 dp min: 0.0 dp max: 22.55221 [14/09 05:03:08] Warping 20 to 19.975 dp min: 0.0 dp max: 11.14985 [14/09 05:03:25] Warping 21 to 20.975 dp min: 0.0 dp max: 17.46382 [14/09 05:03:41] Warping 22 to 21.975 dp min: 0.0 dp max: 23.49506 [14/09 05:03:58] Warping 23 to 22.975 dp min: 0.0 dp max: 13.5775 [14/09 05:04:15] Warping 0 to 0.025 dp min: 0.0 dp max: 21.14868 [14/09 05:04:31] Warping 1 to 1.025 dp min: 0.0 dp max: 12.45852 [14/09 05:04:48] Warping 2 to 2.025 dp min: 0.0 dp max: 19.36399 [14/09 05:05:05] Warping 3 to 3.025 dp min: 0.0 dp max: 11.4649 [14/09 05:05:21] Warping 4 to 4.025 dp min: 0.0 dp max: 24.43324 [14/09 05:05:38] Warping 5 to 5.025 dp min: 0.0 dp max: 13.05872 [14/09 05:05:55] Warping 6 to 6.025 dp min: 0.0 dp max: 24.84434 [14/09 05:06:12] Warping 7 to 7.025 dp min: 0.0 dp max: 24.36036 [14/09 05:06:28] Warping 8 to 8.025 dp min: 0.0 dp max: 10.88568 [14/09 05:06:45] Warping 9 to 9.025 dp min: 0.0 dp max: 24.6271 [14/09 05:07:02] Warping 10 to 10.025 dp min: 0.0 dp max: 19.45285 [14/09 05:07:19] Warping 11 to 11.025 dp min: 0.0 dp max: 23.86458 [14/09 05:07:35] Warping 12 to 12.025 dp min: 0.0 dp max: 20.08919 [14/09 05:07:52] Warping 13 to 13.025 dp min: 0.0 dp max: 17.28836 [14/09 05:08:09] Warping 14 to 14.025 dp min: 0.0 dp max: 24.52304 [14/09 05:08:26] Warping 15 to 15.025 dp min: 0.0 dp max: 19.53295 [14/09 05:08:43] Warping 16 to 16.025 dp min: 2.90754 dp max: 10.89158 [14/09 05:08:59] Warping 17 to 17.025 dp min: 0.0 dp max: 25.35812 [14/09 05:09:16] Warping 18 to 18.025 dp min: 0.0 dp max: 21.20248 [14/09 05:09:33] Warping 19 to 19.025 dp min: 0.0 dp max: 22.55221 [14/09 05:09:50] Warping 20 to 20.025 dp min: 0.0 dp max: 11.14985 [14/09 05:10:07] Warping 21 to 21.025 dp min: 0.0 dp max: 17.46382 [14/09 05:10:23] Warping 22 to 22.025 dp min: 0.0 dp max: 23.49506 [14/09 05:10:40] Warping 23 to 23.025 dp min: 0.0 dp max: 13.5775 [14/09 05:10:57] Training progress: 23%|███████████▉ | 7000/30000 [19:03<10:44, 35.71it/s, EMA Loss=nan, Total Loss=0.7225331, Warp Reg=1.2969884] [ITER 7000] Evaluating train: L1 0.4723234355449677 PSNR 5.808302021026612 [14/09 05:12:17]

[ITER 7000] Saving Gaussians [14/09 05:12:17] Training progress: 67%|███████████████▎ | 20000/30000 [25:24<04:48, 34.62it/s, EMA Loss=nan, Total Loss=0.7643074, Local_Depth=0.9999999, Global Depth=1.0000000]Traceback (most recent call last): File "/mnt/share_disk/SparseGS/train.py", line 459, in training(dataset, op.extract(args), pp.extract(args), args.test_iterations, args.save_iterations, args.checkpoint_iterations, args.start_checkpoint, args.debug_from, args.step, args.max_cameras, args.prune_sched) File "/mnt/share_disk/SparseGS/train.py", line 244, in training prune_floaters(scene.getTrainCameras().copy(), gaussians, pipe, background, dataset, iteration) File "/mnt/share_disk/SparseGS/train.py", line 307, in prune_floaters dips.append(diptest.dipstat(diff[diff > 0].cpu().numpy())) File "/root/miniconda3/envs/sparsegs-test/lib/python3.10/site-packages/diptest/diptest.py", line 80, in dipstat return float(_diptest.diptest(x, allow_zero, debug)) RuntimeError: N must be >= 1. Training progress: 67%|███████████████▎ | 20000/30000 [25:25<12:42, 13.11it/s, EMA Loss=nan, Total Loss=0.7643074, Local_Depth=0.9999999, Global Depth=1.0000000]

ZiyangYan commented 1 month ago

It seems that diptest.dipstat(diff[diff > 0] returns a null value and I tried the experiments in different datasets but got the same results. I also tried it in two machines with CUDA toolkit = 12.1 and 11.3 respectively.

ZiyangYan commented 1 month ago

1726263387414 at the same time, I found the depths output from the training pipeline are very bad, but I used the same checkpoints as you mentioned to extract the gt depth, so I don't think it will be a problem.

ForMyCat commented 1 month ago

Seems like your model is diverging. The diptest will throw error if there is no enough gaussians when the pruning happens. I recommend you try to only enable depth loss first to see if you can get a well trained model first.

ForMyCat / SparseGS

Training error when run SparseGS #22