Different results between training and eval

eyuansu62 commented 2 years ago

Sorry to bother you! But I find another interesting problem. When I start training (with train.json) and get a result in the middle process such as:

      "epoch": 2304.0,
      "eval_exact_match": 0.6460348162475822,
      "eval_exec": 0.6460348162475822,
      "eval_loss": 0.41825902462005615,
      "eval_runtime": 90.718,
      "eval_samples_per_second": 11.398,
      "step": 2304

It can be seen that eval_exact_match is around 0.64.

But if I run evaluation mode (with eval.json), I will get:

   "eval_exact_match": 0.6247582205029013,
    "eval_exec": 0.6431334622823984,
    "eval_loss": 0.41071268916130066,
    "eval_runtime": 244.047,
    "eval_samples": 1034,
    "eval_samples_per_second": 4.237

The eval_exact_match is around 0.62 And the eval.json is

    "run_name": "t5+picard-spider-eval",
    "model_name_or_path": "train/checkpoint-2304",
    "dataset": "spider",
    "source_prefix": "",
    "schema_serialization_type": "peteshaw",
    "schema_serialization_randomized": false,
    "schema_serialization_with_db_id": true,
    "schema_serialization_with_db_content": true,
    "normalize_query": true,
    "target_with_db_id": true,
    "output_dir": "/eval",
    "cache_dir": "/transformers_cache",
    "do_train": false,
    "do_eval": true,
    "fp16": false,
    "per_device_eval_batch_size": 5,
    "seed": 1,
    "report_to": ["tensorboard"],
    "predict_with_generate": true,
    "num_beams": 4,
    "num_beam_groups": 1,
    "diversity_penalty": 0.0,
    "max_val_samples": 1034,
    "use_picard": false,
    "launch_picard": false,
    "picard_mode": "parse_with_guards",
    "picard_schedule": "incremental",
    "picard_max_tokens_to_check": 2,
    "eval_accumulation_steps": 1,
    "metric_config": "both",
    "val_max_target_length": 512,
    "val_max_time": 1200

It is different about 2%. Have you ever seen its problem?

tscholak commented 2 years ago

Yes, I've encountered this problem. For this reason I always report the numbers that are reproducible based on the saved checkpoints and never those during training. I have been unable to pinpoint the origin of the issue, I think though it has to do with mixed precision training and lossy conversions between floating point formats when saving the model weights. If I knew how to reproduce this in a minimal example I'd open an issue with hf transformers.

tscholak commented 2 years ago

@eyuansu62 something I noticed: are you aware that your exact match and exec accuracies are identical? That doesn't seem right, have you made modifications to that code?

tscholak commented 2 years ago

Another thought: the content matching code I borrowed from Victoria Lin et al's BRIDGE model does not necessarily produce the same column values between runs. This instability can explain the discrepancy partially but not fully. If you like to stare at diffs, try comparing the predictions_[step].json files between training and evaluation.

eyuansu62 commented 2 years ago

something I noticed: are you aware that your exact match and exec accuracies are identical? That doesn't seem right, have you made modifications to that code?

I do not modify the metric code. And the same result seems to be a coincidence in 2304 epoch. Because there is:

      "epoch": 3008.0,
      "eval_exact_match": 0.6450676982591876,
      "eval_exec": 0.6421663442940039,
      "eval_loss": 0.45334360003471375,
      "eval_runtime": 96.9869,
      "eval_samples_per_second": 10.661,
      "step": 3008

eyuansu62 commented 2 years ago

content matching code

Recently, I carefully compare the difference between training and evaluation. There are many kinds of error, such as key word error: asc, desc, wrong table name, wrong column name, etc. Because I focus on exact match, the column values seem unimportant to me.

ServiceNow / picard

Different results between training and eval #40