Open X-MAXXIX opened 1 month ago
You may need to specify an existing path for sample storage https://github.com/Mikubill/naifu/blob/b753de32e9b434bd51266f1df4640cd71ecae938/config/train_sdxl.yaml#L71 or completely disable it by setting sample.enabled = False https://github.com/Mikubill/naifu/blob/b753de32e9b434bd51266f1df4640cd71ecae938/config/train_sdxl.yaml#L64
thank you for developing,I'm getting this error when saving checkpoints, I've attached the log below, also this training process seems to break every 20 hours or so of running for unknown reasons. Is there anything that can be done to improve this? tyty
Epoch 0: 16%|█▌ | 7395/45743 13:13:43<68:36:00, 6.44s/it, train_loss: 0.084, avg_loss: 0.080: Traceback (most recent call last):
rank3: File "/workspace/naifu/trainer.py", line 58, in
rank3: File "/workspace/naifu/trainer.py", line 54, in main rank3: Trainer(fabric, config).train_loop() rank3: File "/workspace/naifu/common/trainer.py", line 316, in train_loop
rank3: File "/workspace/naifu/common/trainer.py", line 47, in on_post_training_batch
rank3: File "/workspace/naifu/common/trainer.py", line 167, in perform_sampling rank3: os.makedirs(sampling_cfg.save_dir, exist_ok=True) rank3: File "/usr/lib/python3.10/os.py", line 215, in makedirs rank3: makedirs(head, exist_ok=exist_ok) rank3: File "/usr/lib/python3.10/os.py", line 225, in makedirs rank3: mkdir(name, mode) rank3: OSError: [Errno 5] Input/output error: '/app/naifu555' [rank: 3] Child process with PID 170 terminated with code 1. Forcefully terminating all other processes to avoid zombies 🧟 rank0: Traceback (most recent call last): rank0: File "/workspace/naifu/trainer.py", line 58, in
rank0: File "/workspace/naifu/trainer.py", line 54, in main rank0: Trainer(fabric, config).train_loop() rank0: File "/workspace/naifu/common/trainer.py", line 316, in train_loop
rank0: File "/workspace/naifu/common/trainer.py", line 47, in on_post_training_batch
rank0: File "/workspace/naifu/common/trainer.py", line 167, in perform_sampling rank0: os.makedirs(sampling_cfg.save_dir, exist_ok=True) rank0: File "/usr/lib/python3.10/os.py", line 215, in makedirs rank0: makedirs(head, exist_ok=exist_ok) rank0: File "/usr/lib/python3.10/os.py", line 225, in makedirs rank0: mkdir(name, mode) rank0: OSError: [Errno 5] Input/output error: '/app/naifu555'