XiangLi1999 / Diffusion-LM

Diffusion-LM
Apache License 2.0
1.02k stars 133 forks source link

The relevant code that caused the error is in the Controllable Text Generation section, after the model trained for 6 epochs and started evaluating, it raised a KeyError: 'eval_loss' #65

Open Markkk111 opened 1 year ago

Markkk111 commented 1 year ago

You're welcome! I'm glad to assist you with this question!

  1. The error message is as follows: huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks... To disable this warning, you can either:
    • Avoid using tokenizers before the fork if possible
    • Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false) huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks... To disable this warning, you can either:
    • Avoid using tokenizers before the fork if possible
    • Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false) huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks... To disable this warning, you can either:
    • Avoid using tokenizers before the fork if possible
    • Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false) huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks... To disable this warning, you can either:
    • Avoid using tokenizers before the fork if possible
    • Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false) {'loss': 0.0185, 'learning_rate': 7.973102785782901e-06, 'epoch': 5.04} {'loss': 0.0185, 'learning_rate': 6.9724623759205894e-06, 'epoch': 5.16} {'loss': 0.0188, 'learning_rate': 5.971821966058277e-06, 'epoch': 5.28} {'loss': 0.0178, 'learning_rate': 4.971181556195966e-06, 'epoch': 5.4} wandb: Network error (ReadTimeout), entering retry loop. {'loss': 0.0183, 'learning_rate': 3.970541146333654e-06, 'epoch': 5.52} {'loss': 0.018, 'learning_rate': 2.9699007364713415e-06, 'epoch': 5.64} {'loss': 0.0179, 'learning_rate': 1.96926032660903e-06, 'epoch': 5.76} {'loss': 0.0174, 'learning_rate': 9.68619916746718e-07, 'epoch': 5.88} [INFO|trainer.py:1901] 2023-04-19 17:43:28,689 >>

Training completed. Do not forget to share your model on huggingface.co/models =)

{'train_runtime': 1749.401, 'train_samples_per_second': 142.815, 'train_steps_per_second': 14.281, 'train_loss': 0.023130248267032992, 'epoch': 6.0} [INFO|trainer.py:2709] 2023-04-19 17:43:28,693 >> Saving model checkpoint to classifier_models/e2e-tgt-tree_e=6_b=10_m=bert-base-uncased_wikitext-103-raw-v1_101_wp_None [INFO|configuration_utils.py:453] 2023-04-19 17:43:28,694 >> Configuration saved in classifier_models/e2e-tgt-tree_e=6_b=10_m=bert-base-uncased_wikitext-103-raw-v1_101_wp_None/config.json [INFO|modeling_utils.py:1704] 2023-04-19 17:43:29,841 >> Model weights saved in classifier_models/e2e-tgt-tree_e=6_b=10_m=bert-base-uncased_wikitext-103-raw-v1_101_wp_None/pytorch_model.bin train metrics epoch = 6.0 train_loss = 0.0231 train_runtime = 0:29:09.40 train_samples = 41640 train_samples_per_second = 142.815 train_steps_per_second = 14.281 04/19/2023 17:43:29 - INFO - main - Evaluate [INFO|trainer.py:710] 2023-04-19 17:43:29,848 >> The following columns in the evaluation set don't have a corresponding argument in Classifier_Tree.forward and have been ignored: chart_lst. If chart_lst are not expected by Classifier_Tree.forward, you can safely ignore this message. [INFO|trainer.py:2964] 2023-04-19 17:43:29,850 >> Running Evaluation [INFO|trainer.py:2966] 2023-04-19 17:43:29,850 >> Num examples = 421 [INFO|trainer.py:2969] 2023-04-19 17:43:29,851 >> Batch size = 10 huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks... To disable this warning, you can either:

  1. The relevant code that caused the error is as follows:

    Training

    if training_args.do_train: checkpoint = None if training_args.resume_from_checkpoint is not None: checkpoint = training_args.resume_from_checkpoint elif last_checkpoint is not None: checkpoint = last_checkpoint train_result = trainer.train(resume_from_checkpoint=checkpoint) trainer.save_model() # Saves the tokenizer too for easy upload

    metrics = train_result.metrics
    
    max_train_samples = (
        data_args.max_train_samples if data_args.max_train_samples is not None else len(train_dataset)
    )
    metrics["train_samples"] = min(max_train_samples, len(train_dataset))
    
    trainer.log_metrics("train", metrics)
    trainer.save_metrics("train", metrics)
    trainer.save_state()

    Evaluation

    if training_args.do_eval: logger.info(" Evaluate ")

    metrics = trainer.evaluate()
    
    max_eval_samples = data_args.max_eval_samples if data_args.max_eval_samples is not None else len(eval_dataset)
    metrics["eval_samples"] = min(max_eval_samples, len(eval_dataset))
    try:
        perplexity = math.exp(metrics["eval_loss"])
    except OverflowError:
        perplexity = float("inf")
    metrics["perplexity"] = perplexity
    
    trainer.log_metrics("eval", metrics)
    trainer.save_metrics("eval", metrics)

    kwargs = {"finetuned_from": model_args.model_name_or_path, "tasks": "text-generation"} if data_args.dataset_name is not None: kwargs["dataset_tags"] = data_args.dataset_name if data_args.dataset_config_name is not None: kwargs["dataset_args"] = data_args.dataset_config_name kwargs["dataset"] = f"{data_args.dataset_name} {data_args.dataset_config_name}" else: kwargs["dataset"] = data_args.dataset_name

    if training_args.push_to_hub: trainer.push_to_hub(kwargs) else: trainer.create_model_card(kwargs)

25018528927 commented 1 year ago

Hello! I see that you have the same problem as me, did you solve it? If solved, how to solve it? Looking forward to your answer, very anxious! @Markkk111

heychhavi commented 8 months ago

Hi @Markkk111 @25018528927 , I am also facing this similar issue, anyone of you able to solve this?