FileNotFoundError and ZeroDivisionError during training

YanMinMacMaster commented 3 months ago

During the audio pre-processing, I used DeepSpeech. I found only one file ending with '.wav' in the data folder (I was using the Macron video). It is called aud.wav, and I preprocessed it. During the first training attempt, the terminal displayed "aud_ds.npy not found". So, I renamed aud.wav and aud.npy to aud_ds. Then it displayed errors as stated in the title. The output is like this:

(talking_gaussian) min@min-US-Desktop-Aegis-RS:~/Documents/TalkingGaussian$ bash scripts/train_xx.sh data/macron output/marcron 0 Optimizing output/marcron Output folder: output/marcron [06/08 23:43:47] Found transforms_train.json file, assuming Blender data set! [06/08 23:43:47] Reading Training Transforms [06/08 23:43:47] 7938it [00:01, 4091.06it/s] 4417it [01:09, 1.65s/it]scripts/train_xx.sh: line 8: 70450 Killed python train_mouth.py -s $dataset -m $workspace --audio_extractor $audio_extractor Optimizing output/marcron Output folder: output/marcron [06/08 23:45:04] Found transforms_train.json file, assuming Blender data set! [06/08 23:45:05] Reading Training Transforms [06/08 23:45:05] 7938it [00:01, 4123.37it/s] 5159it [01:35, 2.41s/it]scripts/train_xx.sh: line 9: 70902 Killed python train_face.py -s $dataset -m $workspace --init_num 2000 --densify_grad_threshold 0.0005 --audio_extractor $audio_extractor Optimizing output/marcron Output folder: output/marcron [06/08 23:47:08] Found transforms_train.json file, assuming Blender data set! [06/08 23:47:09] Reading Training Transforms [06/08 23:47:09] 7938it [00:01, 4124.96it/s] 5198it [01:32, 1.05it/s]scripts/train_xx.sh: line 10: 71034 Killed python train_fuse.py -s $dataset -m $workspace --opacity_lr 0.001 --audio_extractor $audio_extractor Looking for config file in output/marcron/cfg_args Config file found: output/marcron/cfg_args Rendering output/marcron Found transforms_train.json file, assuming Blender data set! [06/08 23:48:45] Reading Test Transforms [06/08 23:48:45] 794it [00:00, 3958.49it/s] 794it [00:09, 83.91it/s] Generating random point cloud (10000)... [06/08 23:48:55] Loading Training Cameras [06/08 23:48:55] Loading Test Cameras [06/08 23:48:56] Number of points at initialisation : 10000 [06/08 23:48:57] Traceback (most recent call last): File "synthesize_fuse.py", line 125, in render_sets(model.extract(args), args.iteration, pipeline.extract(args), args.use_train, args.fast, args.dilate) File "synthesize_fuse.py", line 93, in render_sets (model_params, motion_params, model_mouth_params, motion_mouth_params) = torch.load(os.path.join(dataset.model_path, "chkpnt_fuse_latest.pth")) File "/home/min/anaconda3/envs/talking_gaussian/lib/python3.7/site-packages/torch/serialization.py", line 699, in load with _open_file_like(f, 'rb') as opened_file: File "/home/min/anaconda3/envs/talking_gaussian/lib/python3.7/site-packages/torch/serialization.py", line 230, in _open_file_like return _open_file(name_or_buffer, mode) File "/home/min/anaconda3/envs/talking_gaussian/lib/python3.7/site-packages/torch/serialization.py", line 211, in init super(_open_file, self).init(open(name, mode)) FileNotFoundError: [Errno 2] No such file or directory: 'output/marcron/chkpnt_fuse_latest.pth' Setting up [LPIPS] perceptual loss: trunk [alex], v[0.1], spatial [off] /home/min/anaconda3/envs/talking_gaussian/lib/python3.7/site-packages/torchvision/models/_utils.py:209: UserWarning: The parameter 'pretrained' is deprecated since 0.13 and will be removed in 0.15, please use 'weights' instead. f"The parameter '{pretrained_param}' is deprecated since 0.13 and will be removed in 0.15, " /home/min/anaconda3/envs/talking_gaussian/lib/python3.7/site-packages/torchvision/models/_utils.py:223: UserWarning: Arguments other than a weight enum or None for 'weights' are deprecated since 0.13 and will be removed in 0.15. The current behavior is equivalent to passing weights=AlexNet_Weights.IMAGENET1K_V1. You can also use weights=AlexNet_Weights.DEFAULT to get the most up-to-date weights. warnings.warn(msg) Loading model from: /home/min/anaconda3/envs/talking_gaussian/lib/python3.7/site-packages/lpips/weights/v0.1/alex.pth Traceback (most recent call last): File "metrics.py", line 215, in print(lmd_meter.report()) File "metrics.py", line 102, in report return f'LMD ({self.backend}) = {self.measure():.6f}' File "metrics.py", line 96, in measure return self.V / self.N ZeroDivisionError: division by zero

Fictionarry commented 3 months ago

The problem seems to be at this line 4417it [01:09, 1.65s/it]scripts/train_xx.sh: line 8: 70450 Killed python train_mouth.py -s $dataset -m $workspace --audio_extractor $audio_extractor. The training process has been killed.

I guess it's because your computer does not have enough memory to pre-load all the training data. If I remember correctly, training Macron may take 40GB or more memory for pre-loading data. You may make some changes to the code, in order to load the data into memory only when being used.

YanMinMacMaster commented 3 months ago

thx. i will have a try.

KelvinHuang66 commented 3 months ago

同样的问题，有解决的办法吗，2张3090，合计48g都跑不动

Fictionarry commented 3 months ago

同样的问题，有解决的办法吗，2张3090，合计48g都跑不动

是内存的问题，跟显存没关系，减少内存需求需要把预加载到内存里的image和background转为调用时再从磁盘中读取

Fictionarry / TalkingGaussian

FileNotFoundError and ZeroDivisionError during training #25