Lightning-AI / lit-llama

Implementation of the LLaMA language model based on nanoGPT. Supports flash attention, Int8 and GPTQ 4bit quantization, LoRA and LLaMA-Adapter fine-tuning, pre-training. Apache 2.0-licensed.
Apache License 2.0
5.97k stars 518 forks source link

No response after training an epoch #443

Open Dylandtt opened 1 year ago

Dylandtt commented 1 year ago

-iter 1993: loss 0.7506, time: 483.78ms iter 1994: loss 0.9028, time: 339.06ms iter 1995: loss 0.9767, time: 521.26ms iter 1996: loss 0.8616, time: 419.42ms iter 1997: loss 0.7878, time: 480.91ms iter 1998: loss 0.6554, time: 407.63ms Saving adapter weights to out/adapter_v2/alpaca Saving adapter weights to out/adapter_v2/alpaca Saving adapter weights to out/adapter_v2/alpaca Saving adapter weights to out/adapter_v2/alpaca /root/anaconda3/envs/llama/lib/python3.10/site-packages/torch/nn/modules/module.py:1802: UserWarning: Positional args are being deprecated, use kwargs instead. Refer to https://pytorch.org/docs/master/generated/torch.nn.Module.html#torch.nn.Module.state_dict for details. warnings.warn( /root/anaconda3/envs/llama/lib/python3.10/site-packages/torch/nn/modules/module.py:1802: UserWarning: Positional args are being deprecated, use kwargs instead. Refer to https://pytorch.org/docs/master/generated/torch.nn.Module.html#torch.nn.Module.state_dict for details. warnings.warn( /root/anaconda3/envs/llama/lib/python3.10/site-packages/torch/nn/modules/module.py:1802: UserWarning: Positional args are being deprecated, use kwargs instead. Refer to https://pytorch.org/docs/master/generated/torch.nn.Module.html#torch.nn.Module.state_dict for details. warnings.warn( /root/anaconda3/envs/llama/lib/python3.10/site-packages/torch/nn/modules/module.py:1802: UserWarning: Positional args are being deprecated, use kwargs instead. Refer to https://pytorch.org/docs/master/generated/torch.nn.Module.html#torch.nn.Module.state_dict for details. warnings.warn(

LamOne1 commented 1 year ago

you need to control eval_interval and gradient_accumulation_iters variables to show the val results, or you can change the code to evaluate the model at the end of training by copying what's under if step_count % eval_interval == 0 at the end of train function

Dylandtt commented 1 year ago

After training an epoch, it doesn't continue to train and doesn't stop, it stays stuck on this screen