Current loss scale already at minimum - cannot decrease scale anymore. OVERFLOW! Rank 0 Skipping step.

smallcatdog commented 11 months ago

Hello, I am conducting experiments on "emotion impact prediction task and main task". The experimental setup is v100 32g，batch_size=1，history_window=12. May I ask how to solve the following problems? (1) OVERFLOW! Rank 0 Skipping step.
(2) Current loss scale already at minimum - cannot decrease scale anymore.

dongguanting commented 11 months ago

We can not reproduce this issue, may be you can turn off the "fp16" precision setting in my deepspeed config. Here is the similar issue in deepspeed framework.

https://github.com/microsoft/DeepSpeedExamples/issues/418

HenFo commented 10 months ago

Are you using LLaMA2? Changing the deepspeed config does not work sinc they override the config in code (main_new.py). The weights of LLaMA2 seem to have problems with FP16 conversion, which is used in this scenario. I changed the code to use bfloat16 and everything worked fine.

You should find something similar to the following code in main_new.py except for the True and False beeing swapped.

config = LlamaConfig.from_pretrained(args.model_name_or_path)
tokenizer = LlamaTokenizer.from_pretrained(args.model_name_or_path)
model = LlamaForCausalLM.from_pretrained(args.model_name_or_path).half()

deepspeed_config["bfloat16"]["enabled"] = True
deepspeed_config["fp16"]["enabled"] = False

LIN-SHANG / InstructERC

Current loss scale already at minimum - cannot decrease scale anymore. OVERFLOW! Rank 0 Skipping step. #6