ZachB100 / Piper-Training-Guide-with-Screen-Reader

A guide to help newcomers to the Piper TTS system create voices for NVDA and other screen readers down the line.
23 stars 1 forks source link

gpu out of memory wile starting training on colab #5

Open starkrush123 opened 3 months ago

starkrush123 commented 3 months ago

hi, so i want to train a voice to piper tts, using your colab notebook because i don't have a powerful enough gpu. So, when I run the training section cell, the output says something like out of memory thing. DEBUG:piper_train:Namespace(dataset_dir='/content/drive/MyDrive/colab/piper/test', checkpoint_epochs=5, quality='medium', resume_from_single_speaker_checkpoint=None, logger=True, enable_checkpointing=True, default_root_dir=None, gradient_clip_val=None, gradient_clip_algorithm=None, num_nodes=1, num_processes=None, devices='1', gpus=None, auto_select_gpus=False, tpu_cores=None, ipus=None, enable_progress_bar=True, overfit_batches=0.0, track_grad_norm=-1, check_val_every_n_epoch=1, fast_dev_run=False, accumulate_grad_batches=None, max_epochs=10000, min_epochs=None, max_steps=-1, min_steps=None, max_time=None, limit_train_batches=None, limit_val_batches=None, limit_test_batches=None, limit_predict_batches=None, val_check_interval=None, log_every_n_steps=1000, accelerator='gpu', strategy=None, sync_batchnorm=False, precision=32, enable_model_summary=True, weights_save_path=None, num_sanity_val_steps=2, resume_from_checkpoint='/content/pretrained.ckpt', profiler=None, benchmark=None, deterministic=None, reload_dataloaders_every_n_epochs=0, auto_lr_find=False, replace_sampler_ddp=True, detect_anomaly=False, auto_scale_batch_size=False, plugins=None, amp_backend='native', amp_level=None, move_metrics_to_cpu=False, multiple_trainloader_mode='max_size_cycle', batch_size=12, validation_split=0.01, num_test_examples=1, max_phoneme_ids=None, hidden_channels=192, inter_channels=192, filter_channels=768, n_layers=6, n_heads=2, seed=1234, num_ckpt=0, save_last=True) /usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/connectors/checkpoint_connector.py:52: LightningDeprecationWarning: Setting Trainer(resume_from_checkpoint=) is deprecated in v1.5 and will be removed in v1.7. Please pass Trainer.fit(ckpt_path=) directly instead. rank_zero_deprecation( GPU available: True (cuda), used: True TPU available: False, using: 0 TPU cores IPU available: False, using: 0 IPUs HPU available: False, using: 0 HPUs DEBUG:piper_train:Checkpoints will be saved every 5 epoch(s) DEBUG:piper_train:0 Checkpoints will be saved DEBUG:vits.dataset:Loading dataset: /content/drive/MyDrive/colab/piper/test/dataset.jsonl /usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py:731: LightningDeprecationWarning: trainer.resume_from_checkpoint is deprecated in v1.5 and will be removed in v2.0. Specify the fit checkpoint path with trainer.fit(ckpt_path=) instead. ckpt_path = ckpt_path or self.resume_from_checkpoint Restoring states from the checkpoint path at /content/pretrained.ckpt DEBUG:fsspec.local:open file: /content/pretrained.ckpt /usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py:1659: UserWarning: Be aware that when using ckpt_path, callbacks used to create the checkpoint need to be provided during Trainer instantiation. Please add the following callbacks: ["ModelCheckpoint{'monitor': None, 'mode': 'min', 'every_n_train_steps': 0, 'every_n_epochs': 1, 'train_time_interval': None}"]. rank_zero_warn( LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0] 2024-07-06 16:55:15.423415: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered 2024-07-06 16:55:15.423467: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered 2024-07-06 16:55:15.543051: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered 2024-07-06 16:55:15.788233: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations. To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags. DEBUG:tensorflow:Falling back to TensorFlow client; we recommended you install the Cloud TPU client directly with pip install cloud-tpu-client. 2024-07-06 16:55:17.931486: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT DEBUG:h5py._conv:Creating converter from 7 to 5 DEBUG:h5py._conv:Creating converter from 5 to 7 DEBUG:h5py._conv:Creating converter from 7 to 5 DEBUG:h5py._conv:Creating converter from 5 to 7 DEBUG:jax._src.path:etils.epath found. Using etils.epath for file I/O. INFO:numexpr.utils:NumExpr defaulting to 2 threads. DEBUG:fsspec.local:open file: /content/drive/MyDrive/colab/piper/test/lightning_logs/version_11/hparams.yaml Restored all states from the checkpoint file at /content/pretrained.ckpt /usr/local/lib/python3.10/dist-packages/pytorch_lightning/utilities/data.py:153: UserWarning: Total length of DataLoader across ranks is zero. Please make sure this was your intention. rank_zero_warn( /usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py:1892: PossibleUserWarning: The number of training batches (1) is smaller than the logging interval Trainer(log_every_n_steps=1000). Set a lower value for log_every_n_steps if you want to see logs for the training epoch. rank_zero_warn( /usr/lib/python3.10/multiprocessing/popen_fork.py:66: RuntimeWarning: os.fork() was called. os.fork() is incompatible with multithreaded code, and JAX is multithreaded, so this will likely lead to a deadlock. self.pid = os.fork() Traceback (most recent call last): File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "/usr/lib/python3.10/runpy.py", line 86, in _run_code exec(code, run_globals) File "/content/piper/src/python/piper_train/main.py", line 173, in main() File "/content/piper/src/python/piper_train/main.py", line 150, in main trainer.fit(model) File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 696, in fit self._call_and_handle_interrupt( File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 650, in _call_and_handle_interrupt return trainer_fn(*args, kwargs) File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 735, in _fit_impl results = self._run(model, ckpt_path=self.ckpt_path) File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 1166, in _run results = self._run_stage() File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 1252, in _run_stage return self._run_train() File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 1283, in _run_train self.fit_loop.run() File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/loop.py", line 200, in run self.advance(*args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/fit_loop.py", line 271, in advance self._outputs = self.epoch_loop.run(self._data_fetcher) File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/loop.py", line 200, in run self.advance(args, kwargs) File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 203, in advance batch_output = self.batch_loop.run(kwargs) File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/loop.py", line 200, in run self.advance(*args, kwargs) File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/batch/training_batch_loop.py", line 87, in advance outputs = self.optimizer_loop.run(optimizers, kwargs) File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/loop.py", line 200, in run self.advance(*args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 201, in advance result = self._run_optimization(kwargs, self._optimizers[self.optim_progress.optimizer_position]) File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 248, in _run_optimization self._optimizer_step(optimizer, opt_idx, kwargs.get("batch_idx", 0), closure) File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 358, in _optimizer_step self.trainer._call_lightning_module_hook( File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 1550, in _call_lightning_module_hook output = fn(args, kwargs) File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/core/module.py", line 1705, in optimizer_step optimizer.step(closure=optimizer_closure) File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/core/optimizer.py", line 168, in step step_output = self._strategy.optimizer_step(self._optimizer, self._optimizer_idx, closure, kwargs) File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/strategies/strategy.py", line 216, in optimizer_step return self.precision_plugin.optimizer_step(model, optimizer, opt_idx, closure, kwargs) File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/plugins/precision/precision_plugin.py", line 153, in optimizer_step return optimizer.step(closure=closure, kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/optim/lr_scheduler.py", line 68, in wrapper return wrapped(*args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/optim/optimizer.py", line 140, in wrapper out = func(args, kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/autograd/grad_mode.py", line 27, in decorate_context return func(*args, kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/optim/adamw.py", line 120, in step loss = closure() File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/plugins/precision/precision_plugin.py", line 138, in _wrap_closure closure_result = closure() File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 146, in call self._result = self.closure(*args, kwargs) File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 132, in closure step_output = self._step_fn() File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 407, in _training_step training_step_output = self.trainer._call_strategy_hook("training_step", kwargs.values()) File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 1704, in _call_strategy_hook output = fn(args, kwargs) File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/strategies/strategy.py", line 358, in training_step return self.model.training_step(*args, kwargs) File "/content/piper/src/python/piper_train/vits/lightning.py", line 191, in training_step return self.training_step_g(batch) File "/content/piper/src/python/piper_train/vits/lightning.py", line 214, in training_step_g ) = self.model_g(x, x_lengths, spec, spec_lengths, speaker_ids) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1190, in _call_impl return forward_call(*input, kwargs) File "/content/piper/src/python/piper_train/vits/models.py", line 619, in forward x, m_p, logs_p, x_mask = self.enc_p(x, x_lengths) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1190, in _call_impl return forward_call(*input, kwargs) File "/content/piper/src/python/piper_train/vits/models.py", line 205, in forward x = self.encoder(x x_mask, x_mask) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1190, in _call_impl return forward_call(input, kwargs) File "/content/piper/src/python/piper_train/vits/attentions.py", line 66, in forward y = attn_layer(x, x, attn_mask) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1190, in _call_impl return forward_call(*input, kwargs) File "/content/piper/src/python/piper_train/vits/attentions.py", line 220, in forward x, self.attn = self.attention(q, k, v, mask=attn_mask) File "/content/piper/src/python/piper_train/vits/attentions.py", line 238, in attention rel_logits = self._matmul_with_relative_keys( File "/content/piper/src/python/piper_train/vits/attentions.py", line 289, in _matmul_with_relative_keys ret = torch.matmul(x, y.unsqueeze(0).transpose(-2, -1)) torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 8.54 GiB (GPU 0; 14.75 GiB total capacity; 7.94 GiB already allocated; 6.41 GiB free; 7.96 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF Is this a problem with colab, my settings are not correct, or a problem with the notebook?

rmcpantoja commented 3 months ago

Hi @starkrush123, What was your batch size and dataset size? If that happens, decrease batch size, run the step 4 again, and re-run train cell.