Open jeffchy opened 2 months ago
solve by using 24.07 image and install Nemo-Run + upgrade Nemo (build from source) manually
Thanks @jeffchy for creating the issue. Glad to know you were able to fix it. Please let us know if you run into this issue again, and if it's ok to close the issue for now since you were able to solve it.
I'm able pass the phase I mentioned above, but it then raise CheckPointError
@jeffchy is that the same error as above or a new one? Could you share it if it's new?
it's a new one, I'll try to reproduce the error.
Update: I can successfully run the newest pretrain recipe https://github.com/NVIDIA/NeMo/blob/main/examples/llm/run/llama3_pretraining.py
but failed when I want to use fientune_recipe and own model. I replace the hf_resume() with:
def hf_resume() -> Config[nl.AutoResume]:
return Config(nl.AutoResume, import_path="hf://{my local model path}")
And I got
llama3-8b/0 [default3]:[rank3]: self.trainer.strategy.load_optimizer_state_dict(self._loaded_checkpoint)
llama3-8b/0 [default3]:[rank3]: File "/workspace/NeMo/nemo/lightning/pytorch/strategies/megatron_strategy.py", line 636, in load_optimizer_state_dict
llama3-8b/0 [default3]:[rank3]: optimizer_states = checkpoint["optimizer"]
llama3-8b/0 [default3]:[rank3]: KeyError: 'optimizer'
I'm not familiar with nemo, maybe I got something wrong?
import_path
is a special argument that's intended for only HF -> NeMo model converts. If your model is already trained using NeMo, you don't need that. In that can you can use: path as opposed to import_path
.
Thanks for your reply, but if I have a custom fine-tuned HF model (on local device), how to start from it? Do I need to convert it in advance?
Segmentation fault when using the dev container to train the llm finetune recipe: