aik2mlj / polyffusion

Polyffusion: A Diffusion Model for Polyphonic Score Generation with Internal and External Controls
https://polyffusion.github.io
MIT License
71 stars 8 forks source link

DataLoader worker is killed by signal #2

Closed taktak1 closed 9 months ago

taktak1 commented 11 months ago

The learning code polyffusion/main.py does not execute as expected. I learned using dataset POP909 as written in the readme. however DataLoader worker is killed by signal occurs. I verified it several times. It seems to always occur at number 1034.

Epoch 0:  29% 1034/3597 [51:02<2:06:30,  2.96s/it]   
Traceback (most recent call last):
  File "polyffusion/polyffusion/main.py", line 71, in <module>
    config.train()
  File "polyffusion/polyffusion/train/__init__.py", line 49, in train
    learner.train(max_epoch=self.params.max_epoch)
  File "polyffusion/polyffusion/learner.py", line 137, in train
    losses, scheduled_params = self.train_step(batch)
  File "polyffusion/polyffusion/learner.py", line 200, in train_step
    loss_dict = self.model.get_loss_dict(batch, self.step)
  File "polyffusion/polyffusion/models/model_sdf.py", line 189, in get_loss_dict
    cond = self._encode_chord(chord)
  File "polyffusion/polyffusion/models/model_sdf.py", line 95, in _encode_chord
    z = self.chord_enc(chord).mean
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)

  File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/_utils/signal_handling.py", line 66, in handler
    _error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 2978) is killed by signal: Killed. 

@aik2mlj

aik2mlj commented 11 months ago

Hi! I guess there was a out-of-memory(OOM) in your system so the data worker got killed by the system. Have you check the GPU memory usage when training the model? Refer to https://github.com/pytorch/pytorch/issues/8976#issuecomment-803499394

taktak1 commented 11 months ago

I got it. thank you very much.