Closed danielw97 closed 5 hours ago
Hi @danielw97
https://github.com/SWivid/F5-TTS/blob/ab2ad3b005ea839ab698493a819bde909761d96e/src/f5_tts/train/train.py#L24
use an exp_name
that there is no conflict with an existing dir, say, pointing to a new empty dir is fine.
or just clean up all files in the existing dir.
Hi @SWivid Thanks for your fast reply on this. Unfortunately I'm still running into the same error, even when training with an empty ckpt directory. Are there other files I have to clean up beforehand? It looks to me as though these lines in trainer.py are trying to load a pt file that doesn't exist, is there a way to get around this for creating a new model? Thanks again for your help, I know this may be an edge case as most folks will want to finetune.
if "model_last.pt" in os.listdir(self.checkpoint_path):
latest_checkpoint = "model_last.pt"
else:
latest_checkpoint = sorted(
[f for f in os.listdir(self.checkpoint_path) if f.endswith(".pt")],
key=lambda x: int("".join(filter(str.isdigit, x))),
)[-1]
# checkpoint = torch.load(f"{self.checkpoint_path}/{latest_checkpoint}", map_location=self.accelerator.device) # rather use accelerator.load_state ಥ_ಥ
checkpoint = torch.load(f"{self.checkpoint_path}/{latest_checkpoint}", weights_only=True, map_location="cpu")
@danielw97 make sure using an empty dir that verify this https://github.com/SWivid/F5-TTS/blob/ab2ad3b005ea839ab698493a819bde909761d96e/src/f5_tts/model/trainer.py#L148-L153
I'm still running into this issue, unfortunately even with no project directory under ckpts. Calling train.py like this: python src/f5_tts/train/train.py results in the directory being created, but it immediately errors out before training starts with the error I mentioned, I've also put a more complete output below. I'm not sure if it's something I'm doing wrong or not, or if there's another directory I have to make sure doesn't exist.
Using logger: None
Loading dataset ...
Download Vocos from huggingface charactr/vocos-mel-24khz
Sorting with sampler... if slow, check whether dataset is provided with duration: 100%|█| 448/448 [00:00<00:00, 1457756.
Creating dynamic batches with 38400 audio frames per gpu: 100%|██████████████████| 448/448 [00:00<00:00, 2314098.76it/s]
Traceback (most recent call last):
File "/home/daniel/F5-TTS/src/f5_tts/train/train.py", line 103, in <module>
main()
File "/home/daniel/F5-TTS/src/f5_tts/train/train.py", line 96, in main
trainer.train(
File "/home/daniel/F5-TTS/src/f5_tts/model/trainer.py", line 257, in train
start_step = self.load_checkpoint()
File "/home/daniel/F5-TTS/src/f5_tts/model/trainer.py", line 159, in load_checkpoint
latest_checkpoint = sorted(
IndexError: list index out of range
unfortunately even with no project directory under ckpts
so you have make sure that self.checkpoint_path
is pointing to the ckpts/[exp_name_you_use]
which is an empty one and is not another path modified
try print out the self.checkpoint_path
see if correctly set
@danielw97 oh, are you using log_samples=True
?
seems conflict with this newly added feature if True and train from scratch
@SWivid many thanks for your help in tracking this down, that was the problem. Hopefully this can help in future if someone else runs across this though.
@danielw97 yes, we will add a quick fix to it, will be able to use log_samples
Checks
Environment Details
ubuntu 22.04 (wsl) Python 3.10.12
Steps to Reproduce
✔️ Expected Behavior
Training should start at least as I understand it, even without a base model. Thanks in advance for any guidance on this.
❌ Actual Behavior
No response