fishaudio / fish-diffusion

An easy to understand TTS / SVS / SVC framework
https://diff.fish.audio
MIT License
635 stars 81 forks source link

Big error saving checkpoint #76

Closed dillfrescott closed 1 year ago

dillfrescott commented 1 year ago
138.519   Total estimated model params size (MB)
Epoch 105:  25%|███████████▊                                   | 5/20 [00:00<00:02,  6.81it/s, loss=0.0487, v_num=owzc]C:\Users\micro\miniconda3\envs\fish\lib\site-packages\lightning_fabric\plugins\io\torch_io.py:61: UserWarning: Warning, `hyper_parameters` dropped from checkpoint. An attribute is not picklable: Can't pickle local object 'EvaluationLoop.advance.<locals>.batch_to_device'
  rank_zero_warn(f"Warning, `{key}` dropped from checkpoint. An attribute is not picklable: {err}")
Traceback (most recent call last):
  File "C:\Users\micro\miniconda3\envs\fish\lib\site-packages\lightning_fabric\plugins\io\torch_io.py", line 54, in save_checkpoint
    _atomic_save(checkpoint, path)
  File "C:\Users\micro\miniconda3\envs\fish\lib\site-packages\lightning_fabric\utilities\cloud_io.py", line 67, in _atomic_save
    torch.save(checkpoint, bytesbuffer)
  File "C:\Users\micro\miniconda3\envs\fish\lib\site-packages\torch\serialization.py", line 441, in save
    _save(obj, opened_zipfile, pickle_module, pickle_protocol)
  File "C:\Users\micro\miniconda3\envs\fish\lib\site-packages\torch\serialization.py", line 653, in _save
    pickler.dump(obj)
AttributeError: Can't pickle local object 'EvaluationLoop.advance.<locals>.batch_to_device'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\Users\micro\miniconda3\envs\fish\lib\site-packages\torch\serialization.py", line 441, in save
    _save(obj, opened_zipfile, pickle_module, pickle_protocol)
  File "C:\Users\micro\miniconda3\envs\fish\lib\site-packages\torch\serialization.py", line 668, in _save
    zip_file.write_record(name, storage.data_ptr(), num_bytes)
MemoryError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\Users\micro\Downloads\fish-diffusion\tools\diffusion\train.py", line 98, in <module>
    trainer.fit(model, train_loader, valid_loader, ckpt_path=args.resume)
  File "C:\Users\micro\miniconda3\envs\fish\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 608, in fit
    call._call_and_handle_interrupt(
  File "C:\Users\micro\miniconda3\envs\fish\lib\site-packages\pytorch_lightning\trainer\call.py", line 38, in _call_and_handle_interrupt
    return trainer_fn(*args, **kwargs)
  File "C:\Users\micro\miniconda3\envs\fish\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 650, in _fit_impl
    self._run(model, ckpt_path=self.ckpt_path)
  File "C:\Users\micro\miniconda3\envs\fish\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 1112, in _run
    results = self._run_stage()
  File "C:\Users\micro\miniconda3\envs\fish\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 1191, in _run_stage
    self._run_train()
  File "C:\Users\micro\miniconda3\envs\fish\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 1214, in _run_train
    self.fit_loop.run()
  File "C:\Users\micro\miniconda3\envs\fish\lib\site-packages\pytorch_lightning\loops\loop.py", line 199, in run
    self.advance(*args, **kwargs)
  File "C:\Users\micro\miniconda3\envs\fish\lib\site-packages\pytorch_lightning\loops\fit_loop.py", line 267, in advance
    self._outputs = self.epoch_loop.run(self._data_fetcher)
  File "C:\Users\micro\miniconda3\envs\fish\lib\site-packages\pytorch_lightning\loops\loop.py", line 199, in run
    self.advance(*args, **kwargs)
  File "C:\Users\micro\miniconda3\envs\fish\lib\site-packages\pytorch_lightning\loops\epoch\training_epoch_loop.py", line 229, in advance
    self.trainer._call_callback_hooks("on_train_batch_end", batch_end_outputs, batch, batch_idx)
  File "C:\Users\micro\miniconda3\envs\fish\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 1394, in _call_callback_hooks
    fn(self, self.lightning_module, *args, **kwargs)
  File "C:\Users\micro\miniconda3\envs\fish\lib\site-packages\pytorch_lightning\callbacks\model_checkpoint.py", line 296, in on_train_batch_end
    self._save_topk_checkpoint(trainer, monitor_candidates)
  File "C:\Users\micro\miniconda3\envs\fish\lib\site-packages\pytorch_lightning\callbacks\model_checkpoint.py", line 363, in _save_topk_checkpoint
    self._save_none_monitor_checkpoint(trainer, monitor_candidates)
  File "C:\Users\micro\miniconda3\envs\fish\lib\site-packages\pytorch_lightning\callbacks\model_checkpoint.py", line 669, in _save_none_monitor_checkpoint
    self._save_checkpoint(trainer, filepath)
  File "C:\Users\micro\miniconda3\envs\fish\lib\site-packages\pytorch_lightning\callbacks\model_checkpoint.py", line 366, in _save_checkpoint
    trainer.save_checkpoint(filepath, self.save_weights_only)
  File "C:\Users\micro\miniconda3\envs\fish\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 1939, in save_checkpoint
    self._checkpoint_connector.save_checkpoint(filepath, weights_only=weights_only, storage_options=storage_options)
  File "C:\Users\micro\miniconda3\envs\fish\lib\site-packages\pytorch_lightning\trainer\connectors\checkpoint_connector.py", line 511, in save_checkpoint
    self.trainer.strategy.save_checkpoint(_checkpoint, filepath, storage_options=storage_options)
  File "C:\Users\micro\miniconda3\envs\fish\lib\site-packages\pytorch_lightning\strategies\strategy.py", line 466, in save_checkpoint
    self.checkpoint_io.save_checkpoint(checkpoint, filepath, storage_options=storage_options)
  File "C:\Users\micro\miniconda3\envs\fish\lib\site-packages\lightning_fabric\plugins\io\torch_io.py", line 62, in save_checkpoint
    _atomic_save(checkpoint, path)
  File "C:\Users\micro\miniconda3\envs\fish\lib\site-packages\lightning_fabric\utilities\cloud_io.py", line 67, in _atomic_save
    torch.save(checkpoint, bytesbuffer)
  File "C:\Users\micro\miniconda3\envs\fish\lib\site-packages\torch\serialization.py", line 440, in save
    with _open_zipfile_writer(f) as opened_zipfile:
  File "C:\Users\micro\miniconda3\envs\fish\lib\site-packages\torch\serialization.py", line 305, in __exit__
    self.file_like.write_end_of_file()
RuntimeError: [enforce fail at ..\caffe2\serialize\inline_container.cc:337] . unexpected pos 645876672 vs 645876560
Majboor commented 1 year ago

lol, a very common error called "Buy a good PC" or use a cloud GPU. Possibly you could get into the Edge Tpu stuff pretty niche doesn't require good PC

dillfrescott commented 1 year ago

I have a 4090, what are you talking about?

Majboor commented 1 year ago

'MemoryError' means you are training on more than you have maybe you have less ram, try to load little chunks of data instead of the whole at once.

leng-yue commented 1 year ago

It appears that there is an issue with your disk (or torch) which caused the saving process to fail. Unfortunately, we are unable to resolve this issue from our end.

dillfrescott commented 1 year ago

Ah, okay