fishaudio / fish-speech

Brand new TTS solution
https://speech.fish.audio
Other
4.06k stars 325 forks source link

[BUG] Lora trianing Missing key in state_dict #193

Open didadida-r opened 1 month ago

didadida-r commented 1 month ago

Feel free to ask any kind of questions in the issues page, but please use English since other users may find your questions valuable.

Describe the bug Hi,

i follow the doc finetuning and add the lora parameter, but it fails to train with missing state_dict error, Thanks!

If you want to use LoRA, please add the following parameter: +lora@model.lora_config=r_8_alpha_16

To Reproduce Steps to reproduce the behavior:

python fish_speech/train.py \
    --config-name text2semantic_ntes_finetune_44k_ar2 \
    model@model.model=dual_ar_2_codebook_medium \
    +lora@model.lora_config=r_8_alpha_16

Expected behavior A clear and concise description of what you expected to happen.

Screenshots / log

[2024-05-13 16:57:07,956][_main_][INFO] - [rank: 0] Instantiating datamodule <fish_speech.datasets.text.TextDataModule>
[2024-05-13 16:57:09,639][datasets][INFO] - PyTorch version 2.2.0 available.
[2024-05-13 16:57:10,409][_main_][INFO] - [rank: 0] Instantiating model <fish_speech.models.text2semantic.TextToSemantic>
[2024-05-13 16:57:16,370][_main_][INFO] - [rank: 0] Instantiating callbacks...
[2024-05-13 16:57:16,371][fish_speech.utils.instantiators][INFO] - [rank: 0] Instantiating callback <lightning.pytorch.callbacks.ModelCheckpoint>
[2024-05-13 16:57:16,377][fish_speech.utils.instantiators][INFO] - [rank: 0] Instantiating callback <lightning.pytorch.callbacks.ModelSummary>
[2024-05-13 16:57:16,377][fish_speech.utils.instantiators][INFO] - [rank: 0] Instantiating callback <lightning.pytorch.callbacks.LearningRateMonitor>
[2024-05-13 16:57:16,378][fish_speech.utils.instantiators][INFO] - [rank: 0] Instantiating callback <fish_speech.callbacks.GradNormMonitor>
[2024-05-13 16:57:16,389][_main_][INFO] - [rank: 0] Instantiating loggers...
[2024-05-13 16:57:16,390][fish_speech.utils.instantiators][INFO] - [rank: 0] Instantiating logger <lightning.pytorch.loggers.tensorboard.TensorBoardLogger>
[2024-05-13 16:57:16,395][_main_][INFO] - [rank: 0] Instantiating trainer <lightning.pytorch.trainer.Trainer>
Trainer already configured with model summary callbacks: [<class 'lightning.pytorch.callbacks.model_summary.ModelSummary'>]. Skipping setting a default `ModelSummary` callback.
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
[2024-05-13 16:57:18,794][_main_][INFO] - [rank: 0] Logging hyperparameters!
[2024-05-13 16:57:19,240][_main_][INFO] - [rank: 0] Starting training!
[2024-05-13 16:57:19,245][_main_][INFO] - [rank: 0] Resuming from checkpoint: results/text2semantic_finetune_44k_ar2/checkpoints/step_000001000.ckpt
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/1
distributed_backend=nccl
All distributed processes registered. Starting with 1 processes

/home/test/python_env/anaconda3/envs/llm_fisher/lib/python3.10/site-packages/lightning/pytorch/callbacks/model_checkpoint.py:653: Checkpoint directory /home/test/code/TTS/llm_tts/egs/gpt/_tuned/results/text2semantic_finetune_44k_ar2/checkpoints exists and is not empty.
Restoring states from the checkpoint path at results/text2semantic_finetune_44k_ar2/checkpoints/step_000001000.ckpt
[2024-05-13 16:57:34,357][fish_speech.utils.utils][ERROR] - [rank: 0] 
Traceback (most recent call last):
  File "/home/test/code/TTS/llm_tts/fish_speech/utils/utils.py", line 66, in wrap
    metric_dict, object_dict = task_func(cfg=cfg)
  File "/home/test/code/TTS/llm_tts/egs/gpt/_tuned/fish_speech/train.py", line 108, in train
    trainer.fit(model=model, datamodule=datamodule, ckpt_path=ckpt_path)
  File "/home/test/python_env/anaconda3/envs/llm_fisher/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 544, in fit
    call._call_and_handle_interrupt(
  File "/home/test/python_env/anaconda3/envs/llm_fisher/lib/python3.10/site-packages/lightning/pytorch/trainer/call.py", line 43, in _call_and_handle_interrupt
    return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
  File "/home/test/python_env/anaconda3/envs/llm_fisher/lib/python3.10/site-packages/lightning/pytorch/strategies/launchers/subprocess_script.py", line 105, in launch
    return function(*args, **kwargs)
  File "/home/test/python_env/anaconda3/envs/llm_fisher/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 580, in _fit_impl
    self._run(model, ckpt_path=ckpt_path)
  File "/home/test/python_env/anaconda3/envs/llm_fisher/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 956, in _run
    self._checkpoint_connector._restore_modules_and_callbacks(ckpt_path)
  File "/home/test/python_env/anaconda3/envs/llm_fisher/lib/python3.10/site-packages/lightning/pytorch/trainer/connectors/checkpoint_connector.py", line 398, in _restore_modules_and_callbacks
    self.restore_model()
  File "/home/test/python_env/anaconda3/envs/llm_fisher/lib/python3.10/site-packages/lightning/pytorch/trainer/connectors/checkpoint_connector.py", line 275, in restore_model
    self.trainer.strategy.load_model_state_dict(
  File "/home/test/python_env/anaconda3/envs/llm_fisher/lib/python3.10/site-packages/lightning/pytorch/strategies/strategy.py", line 372, in load_model_state_dict
    self.lightning_module.load_state_dict(checkpoint["state_dict"], strict=strict)
  File "/home/test/python_env/anaconda3/envs/llm_fisher/lib/python3.10/site-packages/torch/nn/modules/module.py", line 2153, in load_state_dict
    raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for TextToSemantic:
    Missing key(s) in state_dict: "model.embeddings.lora_A", "model.embeddings.lora_B", "model.layers.0.attention.wqkv.lora_A", "model.layers.0.attention.wqkv.lora_B", "model.layers.0.attention.wo.lora_A", "model.layers.0.attention.wo.lora_B", "model.layers.0.feed_forward.w1.lora_A", "model.layers.0.feed_forward.w1.lora_B", "model.layers.0.feed_forward.w3.lora_A", "model.layers.0.feed_forward.w3.lora_B", "model.layers.0.feed_forward.w2.lora_A", "model.layers.0.feed_forward.w2.lora_B", "model.layers.1.attention.wqkv.lora_A", "model.layers.1.attention.wqkv.lora_B", "model.layers.1.attention.wo.lora_A", "model.layers.1.attention.wo.lora_B", "model.layers.1.feed_forward.w1.lora_A", "model.layers.1.feed_forward.w1.lora_B", "model.layers.1.feed_forward.w3.lora_A", "model.layers.1.feed_forward.w3.lora_B", "model.layers.1.feed_forward.w2.lora_A", "model.layers.1.feed_forward.w2.lora_B", "model.layers.2.attention.wqkv.lora_A", "model.layers.2.attention.wqkv.lora_B", "model.layers.2.attention.wo.lora_A", "model.layers.2.attention.wo.lora_B", "model.layers.2.feed_forward.w1.lora_A", "model.layers.2.feed_forward.w1.lora_B", "model.layers.2.feed_forward.w3.lora_A", "model.layers.2.feed_forward.w3.lora_B", "model.layers.2.feed_forward.w2.lora_A", "model.layers.2.feed_forward.w2.lora_B", "model.layers.3.attention.wqkv.lora_A", "model.layers.3.attention.wqkv.lora_B", "model.layers.3.attention.wo.lora_A", "model.layers.3.attention.wo.lora_B", "model.layers.3.feed_forward.w1.lora_A", "model.layers.3.feed_forward.w1.lora_B", "model.layers.3.feed_forward.w3.lora_A", "model.layers.3.feed_forward.w3.lora_B", "model.layers.3.feed_forward.w2.lora_A", "model.layers.3.feed_forward.w2.lora_B", "model.layers.4.attention.wqkv.lora_A", "model.layers.4.attention.wqkv.lora_B", "model.layers.4.attention.wo.lora_A", "model.layers.4.attention.wo.lora_B", "model.layers.4.feed_forward.w1.lora_A", "model.layers.4.feed_forward.w1.lora_B", "model.layers.4.feed_forward.w3.lora_A", "model.layers.4.feed_forward.w3.lora_B", "model.layers.4.feed_forward.w2.lora_A", "model.layers.4.feed_forward.w2.lora_B", "model.layers.5.attention.wqkv.lora_A", "model.layers.5.attention.wqkv.lora_B", "model.layers.5.attention.wo.lora_A", "model.layers.5.attention.wo.lora_B", "model.layers.5.feed_forward.w1.lora_A", "model.layers.5.feed_forward.w1.lora_B", "model.layers.5.feed_forward.w3.lora_A", "model.layers.5.feed_forward.w3.lora_B", "model.layers.5.feed_forward.w2.lora_A", "model.layers.5.feed_forward.w2.lora_B", "model.layers.6.attention.wqkv.lora_A", "model.layers.6.attention.wqkv.lora_B", "model.layers.6.attention.wo.lora_A", "model.layers.6.attention.wo.lora_B", "model.layers.6.feed_forward.w1.lora_A", "model.layers.6.feed_forward.w1.lora_B", "model.layers.6.feed_forward.w3.lora_A", "model.layers.6.feed_forward.w3.lora_B", "model.layers.6.feed_forward.w2.lora_A", "model.layers.6.feed_forward.w2.lora_B", "model.layers.7.attention.wqkv.lora_A", "model.layers.7.attention.wqkv.lora_B", "model.layers.7.attention.wo.lora_A", "model.layers.7.attention.wo.lora_B", "model.layers.7.feed_forward.w1.lora_A", "model.layers.7.feed_forward.w1.lora_B", "model.layers.7.feed_forward.w3.lora_A", "model.layers.7.feed_forward.w3.lora_B", "model.layers.7.feed_forward.w2.lora_A", "model.layers.7.feed_forward.w2.lora_B", "model.layers.8.attention.wqkv.lora_A", "model.layers.8.attention.wqkv.lora_B", "model.layers.8.attention.wo.lora_A", "model.layers.8.attention.wo.lora_B", "model.layers.8.feed_forward.w1.lora_A", "model.layers.8.feed_forward.w1.lora_B", "model.layers.8.feed_forward.w3.lora_A", "model.layers.8.feed_forward.w3.lora_B", "model.layers.8.feed_forward.w2.lora_A", "model.layers.8.feed_forward.w2.lora_B", "model.layers.9.attention.wqkv.lora_A", "model.layers.9.attention.wqkv.lora_B", "model.layers.9.attention.wo.lora_A", "model.layers.9.attention.wo.lora_B", "model.layers.9.feed_forward.w1.lora_A", "model.layers.9.feed_forward.w1.lora_B", "model.layers.9.feed_forward.w3.lora_A", "model.layers.9.feed_forward.w3.lora_B", "model.layers.9.feed_forward.w2.lora_A", "model.layers.9.feed_forward.w2.lora_B", "model.layers.10.attention.wqkv.lora_A", "model.layers.10.attention.wqkv.lora_B", "model.layers.10.attention.wo.lora_A", "model.layers.10.attention.wo.lora_B", "model.layers.10.feed_forward.w1.lora_A", "model.layers.10.feed_forward.w1.lora_B", "model.layers.10.feed_forward.w3.lora_A", "model.layers.10.feed_forward.w3.lora_B", "model.layers.10.feed_forward.w2.lora_A", "model.layers.10.feed_forward.w2.lora_B", "model.layers.11.attention.wqkv.lora_A", "model.layers.11.attention.wqkv.lora_B", "model.layers.11.attention.wo.lora_A", "model.layers.11.attention.wo.lora_B", "model.layers.11.feed_forward.w1.lora_A", "model.layers.11.feed_forward.w1.lora_B", "model.layers.11.feed_forward.w3.lora_A", "model.layers.11.feed_forward.w3.lora_B", "model.layers.11.feed_forward.w2.lora_A", "model.layers.11.feed_forward.w2.lora_B", "model.layers.12.attention.wqkv.lora_A", "model.layers.12.attention.wqkv.lora_B", "model.layers.12.attention.wo.lora_A", "model.layers.12.attention.wo.lora_B", "model.layers.12.feed_forward.w1.lora_A", "model.layers.12.feed_forward.w1.lora_B", "model.layers.12.feed_forward.w3.lora_A", "model.layers.12.feed_forward.w3.lora_B", "model.layers.12.feed_forward.w2.lora_A", "model.layers.12.feed_forward.w2.lora_B", "model.layers.13.attention.wqkv.lora_A", "model.layers.13.attention.wqkv.lora_B", "model.layers.13.attention.wo.lora_A", "model.layers.13.attention.wo.lora_B", "model.layers.13.feed_forward.w1.lora_A", "model.layers.13.feed_forward.w1.lora_B", "model.layers.13.feed_forward.w3.lora_A", "model.layers.13.feed_forward.w3.lora_B", "model.layers.13.feed_forward.w2.lora_A", "model.layers.13.feed_forward.w2.lora_B", "model.layers.14.attention.wqkv.lora_A", "model.layers.14.attention.wqkv.lora_B", "model.layers.14.attention.wo.lora_A", "model.layers.14.attention.wo.lora_B", "model.layers.14.feed_forward.w1.lora_A", "model.layers.14.feed_forward.w1.lora_B", "model.layers.14.feed_forward.w3.lora_A", "model.layers.14.feed_forward.w3.lora_B", "model.layers.14.feed_forward.w2.lora_A", "model.layers.14.feed_forward.w2.lora_B", "model.layers.15.attention.wqkv.lora_A", "model.layers.15.attention.wqkv.lora_B", "model.layers.15.attention.wo.lora_A", "model.layers.15.attention.wo.lora_B", "model.layers.15.feed_forward.w1.lora_A", "model.layers.15.feed_forward.w1.lora_B", "model.layers.15.feed_forward.w3.lora_A", "model.layers.15.feed_forward.w3.lora_B", "model.layers.15.feed_forward.w2.lora_A", "model.layers.15.feed_forward.w2.lora_B", "model.layers.16.attention.wqkv.lora_A", "model.layers.16.attention.wqkv.lora_B", "model.layers.16.attention.wo.lora_A", "model.layers.16.attention.wo.lora_B", "model.layers.16.feed_forward.w1.lora_A", "model.layers.16.feed_forward.w1.lora_B", "model.layers.16.feed_forward.w3.lora_A", "model.layers.16.feed_forward.w3.lora_B", "model.layers.16.feed_forward.w2.lora_A", "model.layers.16.feed_forward.w2.lora_B", "model.layers.17.attention.wqkv.lora_A", "model.layers.17.attention.wqkv.lora_B", "model.layers.17.attention.wo.lora_A", "model.layers.17.attention.wo.lora_B", "model.layers.17.feed_forward.w1.lora_A", "model.layers.17.feed_forward.w1.lora_B", "model.layers.17.feed_forward.w3.lora_A", "model.layers.17.feed_forward.w3.lora_B", "model.layers.17.feed_forward.w2.lora_A", "model.layers.17.feed_forward.w2.lora_B", "model.layers.18.attention.wqkv.lora_A", "model.layers.18.attention.wqkv.lora_B", "model.layers.18.attention.wo.lora_A", "model.layers.18.attention.wo.lora_B", "model.layers.18.feed_forward.w1.lora_A", "model.layers.18.feed_forward.w1.lora_B", "model.layers.18.feed_forward.w3.lora_A", "model.layers.18.feed_forward.w3.lora_B", "model.layers.18.feed_forward.w2.lora_A", "model.layers.18.feed_forward.w2.lora_B", "model.layers.19.attention.wqkv.lora_A", "model.layers.19.attention.wqkv.lora_B", "model.layers.19.attention.wo.lora_A", "model.layers.19.attention.wo.lora_B", "model.layers.19.feed_forward.w1.lora_A", "model.layers.19.feed_forward.w1.lora_B", "model.layers.19.feed_forward.w3.lora_A", "model.layers.19.feed_forward.w3.lora_B", "model.layers.19.feed_forward.w2.lora_A", "model.layers.19.feed_forward.w2.lora_B", "model.layers.20.attention.wqkv.lora_A", "model.layers.20.attention.wqkv.lora_B", "model.layers.20.attention.wo.lora_A", "model.layers.20.attention.wo.lora_B", "model.layers.20.feed_forward.w1.lora_A", "model.layers.20.feed_forward.w1.lora_B", "model.layers.20.feed_forward.w3.lora_A", "model.layers.20.feed_forward.w3.lora_B", "model.layers.20.feed_forward.w2.lora_A", "model.layers.20.feed_forward.w2.lora_B", "model.layers.21.attention.wqkv.lora_A", "model.layers.21.attention.wqkv.lora_B", "model.layers.21.attention.wo.lora_A", "model.layers.21.attention.wo.lora_B", "model.layers.21.feed_forward.w1.lora_A", "model.layers.21.feed_forward.w1.lora_B", "model.layers.21.feed_forward.w3.lora_A", "model.layers.21.feed_forward.w3.lora_B", "model.layers.21.feed_forward.w2.lora_A", "model.layers.21.feed_forward.w2.lora_B", "model.layers.22.attention.wqkv.lora_A", "model.layers.22.attention.wqkv.lora_B", "model.layers.22.attention.wo.lora_A", "model.layers.22.attention.wo.lora_B", "model.layers.22.feed_forward.w1.lora_A", "model.layers.22.feed_forward.w1.lora_B", "model.layers.22.feed_forward.w3.lora_A", "model.layers.22.feed_forward.w3.lora_B", "model.layers.22.feed_forward.w2.lora_A", "model.layers.22.feed_forward.w2.lora_B", "model.layers.23.attention.wqkv.lora_A", "model.layers.23.attention.wqkv.lora_B", "model.layers.23.attention.wo.lora_A", "model.layers.23.attention.wo.lora_B", "model.layers.23.feed_forward.w1.lora_A", "model.layers.23.feed_forward.w1.lora_B", "model.layers.23.feed_forward.w3.lora_A", "model.layers.23.feed_forward.w3.lora_B", "model.layers.23.feed_forward.w2.lora_A", "model.layers.23.feed_forward.w2.lora_B", "model.output.lora_A", "model.output.lora_B", "model.fast_embeddings.lora_A", "model.fast_embeddings.lora_B", "model.fast_layers.0.attention.wqkv.lora_A", "model.fast_layers.0.attention.wqkv.lora_B", "model.fast_layers.0.attention.wo.lora_A", "model.fast_layers.0.attention.wo.lora_B", "model.fast_layers.0.feed_forward.w1.lora_A", "model.fast_layers.0.feed_forward.w1.lora_B", "model.fast_layers.0.feed_forward.w3.lora_A", "model.fast_layers.0.feed_forward.w3.lora_B", "model.fast_layers.0.feed_forward.w2.lora_A", "model.fast_layers.0.feed_forward.w2.lora_B", "model.fast_layers.1.attention.wqkv.lora_A", "model.fast_layers.1.attention.wqkv.lora_B", "model.fast_layers.1.attention.wo.lora_A", "model.fast_layers.1.attention.wo.lora_B", "model.fast_layers.1.feed_forward.w1.lora_A", "model.fast_layers.1.feed_forward.w1.lora_B", "model.fast_layers.1.feed_forward.w3.lora_A", "model.fast_layers.1.feed_forward.w3.lora_B", "model.fast_layers.1.feed_forward.w2.lora_A", "model.fast_layers.1.feed_forward.w2.lora_B", "model.fast_layers.2.attention.wqkv.lora_A", "model.fast_layers.2.attention.wqkv.lora_B", "model.fast_layers.2.attention.wo.lora_A", "model.fast_layers.2.attention.wo.lora_B", "model.fast_layers.2.feed_forward.w1.lora_A", "model.fast_layers.2.feed_forward.w1.lora_B", "model.fast_layers.2.feed_forward.w3.lora_A", "model.fast_layers.2.feed_forward.w3.lora_B", "model.fast_layers.2.feed_forward.w2.lora_A", "model.fast_layers.2.feed_forward.w2.lora_B", "model.fast_layers.3.attention.wqkv.lora_A", "model.fast_layers.3.attention.wqkv.lora_B", "model.fast_layers.3.attention.wo.lora_A", "model.fast_layers.3.attention.wo.lora_B", "model.fast_layers.3.feed_forward.w1.lora_A", "model.fast_layers.3.feed_forward.w1.lora_B", "model.fast_layers.3.feed_forward.w3.lora_A", "model.fast_layers.3.feed_forward.w3.lora_B", "model.fast_layers.3.feed_forward.w2.lora_A", "model.fast_layers.3.feed_forward.w2.lora_B", "model.fast_layers.4.attention.wqkv.lora_A", "model.fast_layers.4.attention.wqkv.lora_B", "model.fast_layers.4.attention.wo.lora_A", "model.fast_layers.4.attention.wo.lora_B", "model.fast_layers.4.feed_forward.w1.lora_A", "model.fast_layers.4.feed_forward.w1.lora_B", "model.fast_layers.4.feed_forward.w3.lora_A", "model.fast_layers.4.feed_forward.w3.lora_B", "model.fast_layers.4.feed_forward.w2.lora_A", "model.fast_layers.4.feed_forward.w2.lora_B", "model.fast_layers.5.attention.wqkv.lora_A", "model.fast_layers.5.attention.wqkv.lora_B", "model.fast_layers.5.attention.wo.lora_A", "model.fast_layers.5.attention.wo.lora_B", "model.fast_layers.5.feed_forward.w1.lora_A", "model.fast_layers.5.feed_forward.w1.lora_B", "model.fast_layers.5.feed_forward.w3.lora_A", "model.fast_layers.5.feed_forward.w3.lora_B", "model.fast_layers.5.feed_forward.w2.lora_A", "model.fast_layers.5.feed_forward.w2.lora_B", "model.fast_output.lora_A", "model.fast_output.lora_B". 
[2024-05-13 16:57:34,423][fish_speech.utils.utils][INFO] - [rank: 0] Output dir: results/text2semantic_finetune_44k_ar2
Error executing job with overrides: ['model@model.model=dual_ar_2_codebook_medium', '+lora@model.lora_config=r_8_alpha_16']
Traceback (most recent call last):
  File "/home/test/code/TTS/llm_tts/egs/gpt/_tuned/fish_speech/train.py", line 135, in main
    train(cfg)
  File "/home/test/code/TTS/llm_tts/fish_speech/utils/utils.py", line 77, in wrap
    raise ex
  File "/home/test/code/TTS/llm_tts/fish_speech/utils/utils.py", line 66, in wrap
    metric_dict, object_dict = task_func(cfg=cfg)
  File "/home/test/code/TTS/llm_tts/egs/gpt/_tuned/fish_speech/train.py", line 108, in train
    trainer.fit(model=model, datamodule=datamodule, ckpt_path=ckpt_path)
  File "/home/test/python_env/anaconda3/envs/llm_fisher/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 544, in fit
    call._call_and_handle_interrupt(
  File "/home/test/python_env/anaconda3/envs/llm_fisher/lib/python3.10/site-packages/lightning/pytorch/trainer/call.py", line 43, in _call_and_handle_interrupt
    return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
  File "/home/test/python_env/anaconda3/envs/llm_fisher/lib/python3.10/site-packages/lightning/pytorch/strategies/launchers/subprocess_script.py", line 105, in launch
    return function(*args, **kwargs)
  File "/home/test/python_env/anaconda3/envs/llm_fisher/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 580, in _fit_impl
    self._run(model, ckpt_path=ckpt_path)
  File "/home/test/python_env/anaconda3/envs/llm_fisher/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 956, in _run
    self._checkpoint_connector._restore_modules_and_callbacks(ckpt_path)
  File "/home/test/python_env/anaconda3/envs/llm_fisher/lib/python3.10/site-packages/lightning/pytorch/trainer/connectors/checkpoint_connector.py", line 398, in _restore_modules_and_callbacks
    self.restore_model()
  File "/home/test/python_env/anaconda3/envs/llm_fisher/lib/python3.10/site-packages/lightning/pytorch/trainer/connectors/checkpoint_connector.py", line 275, in restore_model
    self.trainer.strategy.load_model_state_dict(
  File "/home/test/python_env/anaconda3/envs/llm_fisher/lib/python3.10/site-packages/lightning/pytorch/strategies/strategy.py", line 372, in load_model_state_dict
    self.lightning_module.load_state_dict(checkpoint["state_dict"], strict=strict)
  File "/home/test/python_env/anaconda3/envs/llm_fisher/lib/python3.10/site-packages/torch/nn/modules/module.py", line 2153, in load_state_dict
    raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for TextToSemantic:
    Missing key(s) in state_dict: "model.embeddings.lora_A", "model.embeddings.lora_B", "model.layers.0.attention.wqkv.lora_A", "model.layers.0.attention.wqkv.lora_B", "model.layers.0.attention.wo.lora_A", "model.layers.0.attention.wo.lora_B", "model.layers.0.feed_forward.w1.lora_A", "model.layers.0.feed_forward.w1.lora_B", "model.layers.0.feed_forward.w3.lora_A", "model.layers.0.feed_forward.w3.lora_B", "model.layers.0.feed_forward.w2.lora_A", "model.layers.0.feed_forward.w2.lora_B", "model.layers.1.attention.wqkv.lora_A", "model.layers.1.attention.wqkv.lora_B", "model.layers.1.attention.wo.lora_A", "model.layers.1.attention.wo.lora_B", "model.layers.1.feed_forward.w1.lora_A", "model.layers.1.feed_forward.w1.lora_B", "model.layers.1.feed_forward.w3.lora_A", "model.layers.1.feed_forward.w3.lora_B", "model.layers.1.feed_forward.w2.lora_A", "model.layers.1.feed_forward.w2.lora_B", "model.layers.2.attention.wqkv.lora_A", "model.layers.2.attention.wqkv.lora_B", "model.layers.2.attention.wo.lora_A", "model.layers.2.attention.wo.lora_B", "model.layers.2.feed_forward.w1.lora_A", "model.layers.2.feed_forward.w1.lora_B", "model.layers.2.feed_forward.w3.lora_A", "model.layers.2.feed_forward.w3.lora_B", "model.layers.2.feed_forward.w2.lora_A", "model.layers.2.feed_forward.w2.lora_B", "model.layers.3.attention.wqkv.lora_A", "model.layers.3.attention.wqkv.lora_B", "model.layers.3.attention.wo.lora_A", "model.layers.3.attention.wo.lora_B", "model.layers.3.feed_forward.w1.lora_A", "model.layers.3.feed_forward.w1.lora_B", "model.layers.3.feed_forward.w3.lora_A", "model.layers.3.feed_forward.w3.lora_B", "model.layers.3.feed_forward.w2.lora_A", "model.layers.3.feed_forward.w2.lora_B", "model.layers.4.attention.wqkv.lora_A", "model.layers.4.attention.wqkv.lora_B", "model.layers.4.attention.wo.lora_A", "model.layers.4.attention.wo.lora_B", "model.layers.4.feed_forward.w1.lora_A", "model.layers.4.feed_forward.w1.lora_B", "model.layers.4.feed_forward.w3.lora_A", "model.layers.4.feed_forward.w3.lora_B", "model.layers.4.feed_forward.w2.lora_A", "model.layers.4.feed_forward.w2.lora_B", "model.layers.5.attention.wqkv.lora_A", "model.layers.5.attention.wqkv.lora_B", "model.layers.5.attention.wo.lora_A", "model.layers.5.attention.wo.lora_B", "model.layers.5.feed_forward.w1.lora_A", "model.layers.5.feed_forward.w1.lora_B", "model.layers.5.feed_forward.w3.lora_A", "model.layers.5.feed_forward.w3.lora_B", "model.layers.5.feed_forward.w2.lora_A", "model.layers.5.feed_forward.w2.lora_B", "model.layers.6.attention.wqkv.lora_A", "model.layers.6.attention.wqkv.lora_B", "model.layers.6.attention.wo.lora_A", "model.layers.6.attention.wo.lora_B", "model.layers.6.feed_forward.w1.lora_A", "model.layers.6.feed_forward.w1.lora_B", "model.layers.6.feed_forward.w3.lora_A", "model.layers.6.feed_forward.w3.lora_B", "model.layers.6.feed_forward.w2.lora_A", "model.layers.6.feed_forward.w2.lora_B", "model.layers.7.attention.wqkv.lora_A", "model.layers.7.attention.wqkv.lora_B", "model.layers.7.attention.wo.lora_A", "model.layers.7.attention.wo.lora_B", "model.layers.7.feed_forward.w1.lora_A", "model.layers.7.feed_forward.w1.lora_B", "model.layers.7.feed_forward.w3.lora_A", "model.layers.7.feed_forward.w3.lora_B", "model.layers.7.feed_forward.w2.lora_A", "model.layers.7.feed_forward.w2.lora_B", "model.layers.8.attention.wqkv.lora_A", "model.layers.8.attention.wqkv.lora_B", "model.layers.8.attention.wo.lora_A", "model.layers.8.attention.wo.lora_B", "model.layers.8.feed_forward.w1.lora_A", "model.layers.8.feed_forward.w1.lora_B", "model.layers.8.feed_forward.w3.lora_A", "model.layers.8.feed_forward.w3.lora_B", "model.layers.8.feed_forward.w2.lora_A", "model.layers.8.feed_forward.w2.lora_B", "model.layers.9.attention.wqkv.lora_A", "model.layers.9.attention.wqkv.lora_B", "model.layers.9.attention.wo.lora_A", "model.layers.9.attention.wo.lora_B", "model.layers.9.feed_forward.w1.lora_A", "model.layers.9.feed_forward.w1.lora_B", "model.layers.9.feed_forward.w3.lora_A", "model.layers.9.feed_forward.w3.lora_B", "model.layers.9.feed_forward.w2.lora_A", "model.layers.9.feed_forward.w2.lora_B", "model.layers.10.attention.wqkv.lora_A", "model.layers.10.attention.wqkv.lora_B", "model.layers.10.attention.wo.lora_A", "model.layers.10.attention.wo.lora_B", "model.layers.10.feed_forward.w1.lora_A", "model.layers.10.feed_forward.w1.lora_B", "model.layers.10.feed_forward.w3.lora_A", "model.layers.10.feed_forward.w3.lora_B", "model.layers.10.feed_forward.w2.lora_A", "model.layers.10.feed_forward.w2.lora_B", "model.layers.11.attention.wqkv.lora_A", "model.layers.11.attention.wqkv.lora_B", "model.layers.11.attention.wo.lora_A", "model.layers.11.attention.wo.lora_B", "model.layers.11.feed_forward.w1.lora_A", "model.layers.11.feed_forward.w1.lora_B", "model.layers.11.feed_forward.w3.lora_A", "model.layers.11.feed_forward.w3.lora_B", "model.layers.11.feed_forward.w2.lora_A", "model.layers.11.feed_forward.w2.lora_B", "model.layers.12.attention.wqkv.lora_A", "model.layers.12.attention.wqkv.lora_B", "model.layers.12.attention.wo.lora_A", "model.layers.12.attention.wo.lora_B", "model.layers.12.feed_forward.w1.lora_A", "model.layers.12.feed_forward.w1.lora_B", "model.layers.12.feed_forward.w3.lora_A", "model.layers.12.feed_forward.w3.lora_B", "model.layers.12.feed_forward.w2.lora_A", "model.layers.12.feed_forward.w2.lora_B", "model.layers.13.attention.wqkv.lora_A", "model.layers.13.attention.wqkv.lora_B", "model.layers.13.attention.wo.lora_A", "model.layers.13.attention.wo.lora_B", "model.layers.13.feed_forward.w1.lora_A", "model.layers.13.feed_forward.w1.lora_B", "model.layers.13.feed_forward.w3.lora_A", "model.layers.13.feed_forward.w3.lora_B", "model.layers.13.feed_forward.w2.lora_A", "model.layers.13.feed_forward.w2.lora_B", "model.layers.14.attention.wqkv.lora_A", "model.layers.14.attention.wqkv.lora_B", "model.layers.14.attention.wo.lora_A", "model.layers.14.attention.wo.lora_B", "model.layers.14.feed_forward.w1.lora_A", "model.layers.14.feed_forward.w1.lora_B", "model.layers.14.feed_forward.w3.lora_A", "model.layers.14.feed_forward.w3.lora_B", "model.layers.14.feed_forward.w2.lora_A", "model.layers.14.feed_forward.w2.lora_B", "model.layers.15.attention.wqkv.lora_A", "model.layers.15.attention.wqkv.lora_B", "model.layers.15.attention.wo.lora_A", "model.layers.15.attention.wo.lora_B", "model.layers.15.feed_forward.w1.lora_A", "model.layers.15.feed_forward.w1.lora_B", "model.layers.15.feed_forward.w3.lora_A", "model.layers.15.feed_forward.w3.lora_B", "model.layers.15.feed_forward.w2.lora_A", "model.layers.15.feed_forward.w2.lora_B", "model.layers.16.attention.wqkv.lora_A", "model.layers.16.attention.wqkv.lora_B", "model.layers.16.attention.wo.lora_A", "model.layers.16.attention.wo.lora_B", "model.layers.16.feed_forward.w1.lora_A", "model.layers.16.feed_forward.w1.lora_B", "model.layers.16.feed_forward.w3.lora_A", "model.layers.16.feed_forward.w3.lora_B", "model.layers.16.feed_forward.w2.lora_A", "model.layers.16.feed_forward.w2.lora_B", "model.layers.17.attention.wqkv.lora_A", "model.layers.17.attention.wqkv.lora_B", "model.layers.17.attention.wo.lora_A", "model.layers.17.attention.wo.lora_B", "model.layers.17.feed_forward.w1.lora_A", "model.layers.17.feed_forward.w1.lora_B", "model.layers.17.feed_forward.w3.lora_A", "model.layers.17.feed_forward.w3.lora_B", "model.layers.17.feed_forward.w2.lora_A", "model.layers.17.feed_forward.w2.lora_B", "model.layers.18.attention.wqkv.lora_A", "model.layers.18.attention.wqkv.lora_B", "model.layers.18.attention.wo.lora_A", "model.layers.18.attention.wo.lora_B", "model.layers.18.feed_forward.w1.lora_A", "model.layers.18.feed_forward.w1.lora_B", "model.layers.18.feed_forward.w3.lora_A", "model.layers.18.feed_forward.w3.lora_B", "model.layers.18.feed_forward.w2.lora_A", "model.layers.18.feed_forward.w2.lora_B", "model.layers.19.attention.wqkv.lora_A", "model.layers.19.attention.wqkv.lora_B", "model.layers.19.attention.wo.lora_A", "model.layers.19.attention.wo.lora_B", "model.layers.19.feed_forward.w1.lora_A", "model.layers.19.feed_forward.w1.lora_B", "model.layers.19.feed_forward.w3.lora_A", "model.layers.19.feed_forward.w3.lora_B", "model.layers.19.feed_forward.w2.lora_A", "model.layers.19.feed_forward.w2.lora_B", "model.layers.20.attention.wqkv.lora_A", "model.layers.20.attention.wqkv.lora_B", "model.layers.20.attention.wo.lora_A", "model.layers.20.attention.wo.lora_B", "model.layers.20.feed_forward.w1.lora_A", "model.layers.20.feed_forward.w1.lora_B", "model.layers.20.feed_forward.w3.lora_A", "model.layers.20.feed_forward.w3.lora_B", "model.layers.20.feed_forward.w2.lora_A", "model.layers.20.feed_forward.w2.lora_B", "model.layers.21.attention.wqkv.lora_A", "model.layers.21.attention.wqkv.lora_B", "model.layers.21.attention.wo.lora_A", "model.layers.21.attention.wo.lora_B", "model.layers.21.feed_forward.w1.lora_A", "model.layers.21.feed_forward.w1.lora_B", "model.layers.21.feed_forward.w3.lora_A", "model.layers.21.feed_forward.w3.lora_B", "model.layers.21.feed_forward.w2.lora_A", "model.layers.21.feed_forward.w2.lora_B", "model.layers.22.attention.wqkv.lora_A", "model.layers.22.attention.wqkv.lora_B", "model.layers.22.attention.wo.lora_A", "model.layers.22.attention.wo.lora_B", "model.layers.22.feed_forward.w1.lora_A", "model.layers.22.feed_forward.w1.lora_B", "model.layers.22.feed_forward.w3.lora_A", "model.layers.22.feed_forward.w3.lora_B", "model.layers.22.feed_forward.w2.lora_A", "model.layers.22.feed_forward.w2.lora_B", "model.layers.23.attention.wqkv.lora_A", "model.layers.23.attention.wqkv.lora_B", "model.layers.23.attention.wo.lora_A", "model.layers.23.attention.wo.lora_B", "model.layers.23.feed_forward.w1.lora_A", "model.layers.23.feed_forward.w1.lora_B", "model.layers.23.feed_forward.w3.lora_A", "model.layers.23.feed_forward.w3.lora_B", "model.layers.23.feed_forward.w2.lora_A", "model.layers.23.feed_forward.w2.lora_B", "model.output.lora_A", "model.output.lora_B", "model.fast_embeddings.lora_A", "model.fast_embeddings.lora_B", "model.fast_layers.0.attention.wqkv.lora_A", "model.fast_layers.0.attention.wqkv.lora_B", "model.fast_layers.0.attention.wo.lora_A", "model.fast_layers.0.attention.wo.lora_B", "model.fast_layers.0.feed_forward.w1.lora_A", "model.fast_layers.0.feed_forward.w1.lora_B", "model.fast_layers.0.feed_forward.w3.lora_A", "model.fast_layers.0.feed_forward.w3.lora_B", "model.fast_layers.0.feed_forward.w2.lora_A", "model.fast_layers.0.feed_forward.w2.lora_B", "model.fast_layers.1.attention.wqkv.lora_A", "model.fast_layers.1.attention.wqkv.lora_B", "model.fast_layers.1.attention.wo.lora_A", "model.fast_layers.1.attention.wo.lora_B", "model.fast_layers.1.feed_forward.w1.lora_A", "model.fast_layers.1.feed_forward.w1.lora_B", "model.fast_layers.1.feed_forward.w3.lora_A", "model.fast_layers.1.feed_forward.w3.lora_B", "model.fast_layers.1.feed_forward.w2.lora_A", "model.fast_layers.1.feed_forward.w2.lora_B", "model.fast_layers.2.attention.wqkv.lora_A", "model.fast_layers.2.attention.wqkv.lora_B", "model.fast_layers.2.attention.wo.lora_A", "model.fast_layers.2.attention.wo.lora_B", "model.fast_layers.2.feed_forward.w1.lora_A", "model.fast_layers.2.feed_forward.w1.lora_B", "model.fast_layers.2.feed_forward.w3.lora_A", "model.fast_layers.2.feed_forward.w3.lora_B", "model.fast_layers.2.feed_forward.w2.lora_A", "model.fast_layers.2.feed_forward.w2.lora_B", "model.fast_layers.3.attention.wqkv.lora_A", "model.fast_layers.3.attention.wqkv.lora_B", "model.fast_layers.3.attention.wo.lora_A", "model.fast_layers.3.attention.wo.lora_B", "model.fast_layers.3.feed_forward.w1.lora_A", "model.fast_layers.3.feed_forward.w1.lora_B", "model.fast_layers.3.feed_forward.w3.lora_A", "model.fast_layers.3.feed_forward.w3.lora_B", "model.fast_layers.3.feed_forward.w2.lora_A", "model.fast_layers.3.feed_forward.w2.lora_B", "model.fast_layers.4.attention.wqkv.lora_A", "model.fast_layers.4.attention.wqkv.lora_B", "model.fast_layers.4.attention.wo.lora_A", "model.fast_layers.4.attention.wo.lora_B", "model.fast_layers.4.feed_forward.w1.lora_A", "model.fast_layers.4.feed_forward.w1.lora_B", "model.fast_layers.4.feed_forward.w3.lora_A", "model.fast_layers.4.feed_forward.w3.lora_B", "model.fast_layers.4.feed_forward.w2.lora_A", "model.fast_layers.4.feed_forward.w2.lora_B", "model.fast_layers.5.attention.wqkv.lora_A", "model.fast_layers.5.attention.wqkv.lora_B", "model.fast_layers.5.attention.wo.lora_A", "model.fast_layers.5.attention.wo.lora_B", "model.fast_layers.5.feed_forward.w1.lora_A", "model.fast_layers.5.feed_forward.w1.lora_B", "model.fast_layers.5.feed_forward.w3.lora_A", "model.fast_layers.5.feed_forward.w3.lora_B", "model.fast_layers.5.feed_forward.w2.lora_A", "model.fast_layers.5.feed_forward.w2.lora_B", "model.fast_output.lora_A", "model.fast_output.lora_B". 

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.

Additional context Add any other context about the problem here.

leng-yue commented 1 month ago

Seems you are resuming training, which is not currently supported for LoRA

yixian3500 commented 2 weeks ago

Seems you are resuming training, which is not currently supported for LoRA

I got the same error here, could you help to point what is the correct steps? the way isn't “ resuming training" ? thanks !