The relevant code that caused the error is in the Controllable Text Generation section, after the model trained for 6 epochs and started evaluating, it raised a KeyError: 'eval_loss'

You're welcome! I'm glad to assist you with this question！

The error message is as follows： huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks... To disable this warning, you can either:
- Avoid using tokenizers before the fork if possible
- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false) huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks... To disable this warning, you can either:
- Avoid using tokenizers before the fork if possible
- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false) huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks... To disable this warning, you can either:
- Avoid using tokenizers before the fork if possible
- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false) huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks... To disable this warning, you can either:
- Avoid using tokenizers before the fork if possible
- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false) {'loss': 0.0185, 'learning_rate': 7.973102785782901e-06, 'epoch': 5.04} {'loss': 0.0185, 'learning_rate': 6.9724623759205894e-06, 'epoch': 5.16} {'loss': 0.0188, 'learning_rate': 5.971821966058277e-06, 'epoch': 5.28} {'loss': 0.0178, 'learning_rate': 4.971181556195966e-06, 'epoch': 5.4} wandb: Network error (ReadTimeout), entering retry loop. {'loss': 0.0183, 'learning_rate': 3.970541146333654e-06, 'epoch': 5.52} {'loss': 0.018, 'learning_rate': 2.9699007364713415e-06, 'epoch': 5.64} {'loss': 0.0179, 'learning_rate': 1.96926032660903e-06, 'epoch': 5.76} {'loss': 0.0174, 'learning_rate': 9.68619916746718e-07, 'epoch': 5.88} [INFO|trainer.py:1901] 2023-04-19 17:43:28,689 >>

Training completed. Do not forget to share your model on huggingface.co/models =)

{'train_runtime': 1749.401, 'train_samples_per_second': 142.815, 'train_steps_per_second': 14.281, 'train_loss': 0.023130248267032992, 'epoch': 6.0} [INFO|trainer.py:2709] 2023-04-19 17:43:28,693 >> Saving model checkpoint to classifier_models/e2e-tgt-tree_e=6_b=10_m=bert-base-uncased_wikitext-103-raw-v1_101_wp_None [INFO|configuration_utils.py:453] 2023-04-19 17:43:28,694 >> Configuration saved in classifier_models/e2e-tgt-tree_e=6_b=10_m=bert-base-uncased_wikitext-103-raw-v1_101_wp_None/config.json [INFO|modeling_utils.py:1704] 2023-04-19 17:43:29,841 >> Model weights saved in classifier_models/e2e-tgt-tree_e=6_b=10_m=bert-base-uncased_wikitext-103-raw-v1_101_wp_None/pytorch_model.bin train metrics epoch = 6.0 train_loss = 0.0231 train_runtime = 0:29:09.40 train_samples = 41640 train_samples_per_second = 142.815 train_steps_per_second = 14.281 04/19/2023 17:43:29 - INFO - main - Evaluate [INFO|trainer.py:710] 2023-04-19 17:43:29,848 >> The following columns in the evaluation set don't have a corresponding argument in Classifier_Tree.forward and have been ignored: chart_lst. If chart_lst are not expected by Classifier_Tree.forward, you can safely ignore this message. [INFO|trainer.py:2964] 2023-04-19 17:43:29,850 >> Running Evaluation [INFO|trainer.py:2966] 2023-04-19 17:43:29,850 >> Num examples = 421 [INFO|trainer.py:2969] 2023-04-19 17:43:29,851 >> Batch size = 10 huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks... To disable this warning, you can either:

Avoid using tokenizers before the fork if possible
Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false) huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks... To disable this warning, you can either:
Avoid using tokenizers before the fork if possible
Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false) huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks... To disable this warning, you can either:
Avoid using tokenizers before the fork if possible
Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false) huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks... To disable this warning, you can either:
Avoid using tokenizers before the fork if possible
Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false) {'eval_runtime': 1.4868, 'eval_samples_per_second': 283.16, 'eval_steps_per_second': 28.921, 'epoch': 6.0} Traceback (most recent call last): File "/home/name/diffusion-LM/transformers/examples/pytorch/language-modeling/run_clm.py", line 1704, in main() File "/home/name/diffusion-LM/transformers/examples/pytorch/language-modeling/run_clm.py", line 1675, in main perplexity = math.exp(metrics["eval_loss"]) KeyError: 'eval_loss' wandb: Waiting for W&B process to finish... (failed 1). Press Control-C to abort syncing. Exception ignored in atexit callback: <function _Manager._atexit_setup.. at 0x7f2f280f1fc0> Traceback (most recent call last): File "/home/name/anaconda3/lib/python3.10/site-packages/wandb/sdk/wandb_manager.py", line 166, in self._atexit_lambda = lambda: self._atexit_teardown() File "/home/name/anaconda3/lib/python3.10/site-packages/wandb/sdk/wandb_manager.py", line 175, in _atexit_teardown self._teardown(exit_code) File "/home/name/anaconda3/lib/python3.10/site-packages/wandb/sdk/wandb_manager.py", line 186, in _teardown result = self._service.join() File "/home/name/anaconda3/lib/python3.10/site-packages/wandb/sdk/service/service.py", line 216, in join ret = self._internal_proc.wait() File "/home/name/anaconda3/lib/python3.10/subprocess.py", line 1204, in wait return self._wait(timeout=timeout) File "/home/name/anaconda3/lib/python3.10/subprocess.py", line 1938, in _wait (pid, sts) = self._try_wait(0) File "/home/name/anaconda3/lib/python3.10/subprocess.py", line 1896, in _try_wait (pid, sts) = os.waitpid(self.pid, wait_flags) KeyboardInterrupt: (diffusion-LM) name@taizun-SYS-4029GP-TRT:~/diffusion-LM$ wandb: - 0.010 MB of 0.010 MB uploaded (0.(diffusion-LM) name@taizun-SYS-4029GP-TRT:~/diffusion-LM$ wandb: / 0.010 MB of 0.010 MB uploaded (0.wandb: \ 0.010 MB of 0.010 MB uploaded (0.000 MB deduped)

The relevant code that caused the error is as follows：

Training

if training_args.do_train: checkpoint = None if training_args.resume_from_checkpoint is not None: checkpoint = training_args.resume_from_checkpoint elif last_checkpoint is not None: checkpoint = last_checkpoint train_result = trainer.train(resume_from_checkpoint=checkpoint) trainer.save_model() # Saves the tokenizer too for easy upload
```
metrics = train_result.metrics

max_train_samples = (
    data_args.max_train_samples if data_args.max_train_samples is not None else len(train_dataset)
)
metrics["train_samples"] = min(max_train_samples, len(train_dataset))

trainer.log_metrics("train", metrics)
trainer.save_metrics("train", metrics)
trainer.save_state()
```
Evaluation

if training_args.do_eval: logger.info(" Evaluate ")
```
metrics = trainer.evaluate()

max_eval_samples = data_args.max_eval_samples if data_args.max_eval_samples is not None else len(eval_dataset)
metrics["eval_samples"] = min(max_eval_samples, len(eval_dataset))
try:
    perplexity = math.exp(metrics["eval_loss"])
except OverflowError:
    perplexity = float("inf")
metrics["perplexity"] = perplexity

trainer.log_metrics("eval", metrics)
trainer.save_metrics("eval", metrics)
```
kwargs = {"finetuned_from": model_args.model_name_or_path, "tasks": "text-generation"} if data_args.dataset_name is not None: kwargs["dataset_tags"] = data_args.dataset_name if data_args.dataset_config_name is not None: kwargs["dataset_args"] = data_args.dataset_config_name kwargs["dataset"] = f"{data_args.dataset_name} {data_args.dataset_config_name}" else: kwargs["dataset"] = data_args.dataset_name

if training_args.push_to_hub: trainer.push_to_hub(kwargs) else: trainer.create_model_card(kwargs)

XiangLi1999 / Diffusion-LM

The relevant code that caused the error is in the Controllable Text Generation section, after the model trained for 6 epochs and started evaluating, it raised a KeyError: 'eval_loss' #65

Training

Evaluation