Mikubill / naifu

Train generative models with pytorch lightning
MIT License
294 stars 38 forks source link

checkpoint saving error #35

Open X-MAXXIX opened 1 month ago

X-MAXXIX commented 1 month ago

thank you for developing,I'm getting this error when saving checkpoints, I've attached the log below, also this training process seems to break every 20 hours or so of running for unknown reasons. Is there anything that can be done to improve this? tyty

Epoch 0: 16%|█▌ | 7395/45743 13:13:43<68:36:00, 6.44s/it, train_loss: 0.084, avg_loss: 0.080: Traceback (most recent call last):
rank3: File "/workspace/naifu/trainer.py", line 58, in

rank3: File "/workspace/naifu/trainer.py", line 54, in main rank3: Trainer(fabric, config).train_loop() rank3: File "/workspace/naifu/common/trainer.py", line 316, in train_loop

rank3: File "/workspace/naifu/common/trainer.py", line 47, in on_post_training_batch

rank3: File "/workspace/naifu/common/trainer.py", line 167, in perform_sampling rank3: os.makedirs(sampling_cfg.save_dir, exist_ok=True) rank3: File "/usr/lib/python3.10/os.py", line 215, in makedirs rank3: makedirs(head, exist_ok=exist_ok) rank3: File "/usr/lib/python3.10/os.py", line 225, in makedirs rank3: mkdir(name, mode) rank3: OSError: [Errno 5] Input/output error: '/app/naifu555' [rank: 3] Child process with PID 170 terminated with code 1. Forcefully terminating all other processes to avoid zombies 🧟 rank0: Traceback (most recent call last): rank0: File "/workspace/naifu/trainer.py", line 58, in

rank0: File "/workspace/naifu/trainer.py", line 54, in main rank0: Trainer(fabric, config).train_loop() rank0: File "/workspace/naifu/common/trainer.py", line 316, in train_loop

rank0: File "/workspace/naifu/common/trainer.py", line 47, in on_post_training_batch

rank0: File "/workspace/naifu/common/trainer.py", line 167, in perform_sampling rank0: os.makedirs(sampling_cfg.save_dir, exist_ok=True) rank0: File "/usr/lib/python3.10/os.py", line 215, in makedirs rank0: makedirs(head, exist_ok=exist_ok) rank0: File "/usr/lib/python3.10/os.py", line 225, in makedirs rank0: mkdir(name, mode) rank0: OSError: [Errno 5] Input/output error: '/app/naifu555'

Mikubill commented 1 month ago

You may need to specify an existing path for sample storage https://github.com/Mikubill/naifu/blob/b753de32e9b434bd51266f1df4640cd71ecae938/config/train_sdxl.yaml#L71 or completely disable it by setting sample.enabled = False https://github.com/Mikubill/naifu/blob/b753de32e9b434bd51266f1df4640cd71ecae938/config/train_sdxl.yaml#L64