unable to train a model from scratch

danielw97 commented 19 hours ago

Checks

[X] This template is only for bug reports, usage problems go with 'Help Wanted'.
[X] I have thoroughly reviewed the project documentation but couldn't find information to solve my problem.
[X] I have searched for existing issues, including closed ones, and couldn't find a solution.
[X] I confirm that I am using English to submit this report in order to facilitate communication.

Environment Details

ubuntu 22.04 (wsl) Python 3.10.12

Steps to Reproduce

following the instructions regarding training, I setup a dataset as well as modified train.py to point to it and setup training parameters.
I initiate training, however receive an error if I choose to not use a base model to finetune from. For experimentation I'm interested in training from scratch, so please let me know if there's something different I have to do to initialize the model. I'm getting the error below:

Traceback (most recent call last):                                                                                      
  File "/home/daniel/F5-TTS/src/f5_tts/train/train.py", line 103, in <module>                                           
    main()                                                                                                              
  File "/home/daniel/F5-TTS/src/f5_tts/train/train.py", line 96, in main                                                
    trainer.train(                                                                                                      
  File "/home/daniel/F5-TTS/src/f5_tts/model/trainer.py", line 257, in train                                            
    start_step = self.load_checkpoint()                                                                                 
  File "/home/daniel/F5-TTS/src/f5_tts/model/trainer.py", line 159, in load_checkpoint                                  
    latest_checkpoint = sorted(                                                                                         
IndexError: list index out of range

✔️ Expected Behavior

Training should start at least as I understand it, even without a base model. Thanks in advance for any guidance on this.

❌ Actual Behavior

No response

SWivid commented 13 hours ago

Hi @danielw97 https://github.com/SWivid/F5-TTS/blob/ab2ad3b005ea839ab698493a819bde909761d96e/src/f5_tts/train/train.py#L24 use an exp_name that there is no conflict with an existing dir, say, pointing to a new empty dir is fine. or just clean up all files in the existing dir.

danielw97 commented 6 hours ago

Hi @SWivid Thanks for your fast reply on this. Unfortunately I'm still running into the same error, even when training with an empty ckpt directory. Are there other files I have to clean up beforehand? It looks to me as though these lines in trainer.py are trying to load a pt file that doesn't exist, is there a way to get around this for creating a new model? Thanks again for your help, I know this may be an edge case as most folks will want to finetune.

        if "model_last.pt" in os.listdir(self.checkpoint_path):
            latest_checkpoint = "model_last.pt"
        else:
            latest_checkpoint = sorted(
                [f for f in os.listdir(self.checkpoint_path) if f.endswith(".pt")],
                key=lambda x: int("".join(filter(str.isdigit, x))),
            )[-1]
        # checkpoint = torch.load(f"{self.checkpoint_path}/{latest_checkpoint}", map_location=self.accelerator.device)  # rather use accelerator.load_state ಥ_ಥ
        checkpoint = torch.load(f"{self.checkpoint_path}/{latest_checkpoint}", weights_only=True, map_location="cpu")

SWivid commented 6 hours ago

@danielw97 make sure using an empty dir that verify this https://github.com/SWivid/F5-TTS/blob/ab2ad3b005ea839ab698493a819bde909761d96e/src/f5_tts/model/trainer.py#L148-L153

danielw97 commented 6 hours ago

I'm still running into this issue, unfortunately even with no project directory under ckpts. Calling train.py like this: python src/f5_tts/train/train.py results in the directory being created, but it immediately errors out before training starts with the error I mentioned, I've also put a more complete output below. I'm not sure if it's something I'm doing wrong or not, or if there's another directory I have to make sure doesn't exist.

Using logger: None                                                                                                      
Loading dataset ...                                                                                                     
Download Vocos from huggingface charactr/vocos-mel-24khz                                                                
Sorting with sampler... if slow, check whether dataset is provided with duration: 100%|█| 448/448 [00:00<00:00, 1457756.
Creating dynamic batches with 38400 audio frames per gpu: 100%|██████████████████| 448/448 [00:00<00:00, 2314098.76it/s]
Traceback (most recent call last):                                                                                      
  File "/home/daniel/F5-TTS/src/f5_tts/train/train.py", line 103, in <module>                                           
    main()                                                                                                              
  File "/home/daniel/F5-TTS/src/f5_tts/train/train.py", line 96, in main                                                
    trainer.train(                                                                                                      
  File "/home/daniel/F5-TTS/src/f5_tts/model/trainer.py", line 257, in train                                            
    start_step = self.load_checkpoint()                                                                                 
  File "/home/daniel/F5-TTS/src/f5_tts/model/trainer.py", line 159, in load_checkpoint                                  
    latest_checkpoint = sorted(                                                                                         
IndexError: list index out of range

SWivid commented 6 hours ago

unfortunately even with no project directory under ckpts

so you have make sure that self.checkpoint_path is pointing to the ckpts/[exp_name_you_use] which is an empty one and is not another path modified try print out the self.checkpoint_path see if correctly set

SWivid commented 5 hours ago

@danielw97 oh, are you using log_samples=True? seems conflict with this newly added feature if True and train from scratch

danielw97 commented 5 hours ago

@SWivid many thanks for your help in tracking this down, that was the problem. Hopefully this can help in future if someone else runs across this though.

SWivid commented 5 hours ago

@danielw97 yes, we will add a quick fix to it, will be able to use log_samples

SWivid / F5-TTS