Low training metrics - Githubissues

runnerup96 commented 1 year ago

Hello!

I am trying to reproduce the training procedure of training RESDSQL with T5-base on text2sql task. I took the original train.json and dev.json SPIDER files from leaderboard (https://yale-lily.github.io/spider). I follow training procedure given in the README file.

I took the default script parameters from scripts/train/text2sql folder:

I used Roberta Large(https://huggingface.co/roberta-large) for schema item classifier training and T5-base (https://huggingface.co/t5-base) for text2sql training.
I reduced the batch size to 8 from 16 while increasing gradient descent step to 4 from 2 in order to fit into my resources.
I fixed seed 42 along with other seeds for all train phases.

I trained for 128 epochs and evaluated using this repo evaluation script and reevaluated with https://github.com/taoyds/test-suite-sql-eval

I get EM 0.489 and EX 0.527 - while in the paper its stated that this configuration achieves EM 0.717 and EX 0.779 on dev set. Can you please point me where I could get the training wrong? I am evaluating different text2sql model for OOD robustness and choose your solution as one of the current SoTA in that domain.

lihaoyang-ruc commented 1 year ago

Hello,

Is your setup identical to ours? We conduct model training on the Ubuntu operating system, and the specifics of our Python environment are detailed in the README file.

Additionally, since RESDSQL operates in two stages, you have the option to substitute one of the stages with our published checkpoints. This should help determine which part is yielding incorrect results.

lihaoyang-ruc commented 1 year ago

Have you conducted evaluations on all the intermediate checkpoints for the T5-base model? Your results appear to indicate that only the first checkpoint was assessed.

runnerup96 commented 1 year ago

Hello,

Is your setup identical to ours? We conduct model training on the Ubuntu operating system, and the specifics of our Python environment are detailed in the README file.

Additionally, since RESDSQL operates in two stages, you have the option to substitute one of the stages with our published checkpoints. This should help determine which part is yielding incorrect results.

Yes, I've followed README environment installation steps, so I have the same environment.

At first, I've have used your published checkpoints for inference and did get the same results as in the paper. Its the training metrics that do not match in my case.

runnerup96 commented 1 year ago

Have you conducted evaluations on all the intermediate checkpoints for the T5-base model? Your results appear to indicate that only the first checkpoint was assessed.

I've conducted evaluation on the best checkpoint after 128 epochs.

lihaoyang-ruc commented 1 year ago

Have you conducted evaluations on all the intermediate checkpoints for the T5-base model? Your results appear to indicate that only the first checkpoint was assessed.

I've conducted evaluation on the best checkpoint after 128 epochs.

What you are saying is that you have evaluated all intermediate checkpoints (there are a total of 85 checkpoints for training 128 epochs) and the performance of the best of them is EM 0.489 and EX 0.527?

runnerup96 commented 1 year ago

What you are saying is that you have evaluated all intermediate checkpoints (there are a total of 85 checkpoints for training 128 epochs) and the performance of the best of them is EM 0.489 and EX 0.527?

For evaluation I have used the original evaluate_text2sql_ckpts.py for all saved checkpoints.

runnerup96 commented 1 year ago

Are you using exact these initial checkpoints?

https://huggingface.co/roberta-large https://huggingface.co/t5-base

lihaoyang-ruc commented 1 year ago

Are you using exact these initial checkpoints?

https://huggingface.co/roberta-large https://huggingface.co/t5-base

Yes

lihaoyang-ruc commented 1 year ago

Can you show your TensorBoard logs for training the schema item classifier and T5-base? Other than that, can you share the evaluation results if it's convenient? Like this:

runnerup96 commented 1 year ago

Sure!

I decided to spot the source of low training error so I took your item classifier checkpoint(text2sql_schema_item_classifier) and started training just text2sql stage. I get about same metrics as before:

Best EM ckpt: {'ckpt': './models/spider_no_others/text2sql-t5-base_1/checkpoint-101169', 'EM': 0.48936170212765956, 'EXEC': 0.5270793036750484} Best EXEC ckpt: {'ckpt': './models/spider_no_others/text2sql-t5-base_1/checkpoint-101169', 'EM': 0.48936170212765956, 'EXEC': 0.5270793036750484} Best EM+EXEC ckpt: {'ckpt': './models/spider_no_others/text2sql-t5-base_1/checkpoint-101169', 'EM': 0.48936170212765956, 'EXEC': 0.5270793036750484}

Looks like the problem in training t5 model.

For space optimization i keep only last 10 checkpoints.

I attach my training loss pictures from tensorboard and csv with loss history.

run-.-tag-train loss.csv

lihaoyang-ruc commented 1 year ago

Your training losses appear to be correct.

I suspect the issue may lie with the evaluation scripts or the pre-processed dataset.

Could you delete the evaluation result files (named checkpoints-*.txt) and rerun the evaluate_text2sql_ckpts.py script?
If the performance remains atypical, I recommend using our pre-processed spider-dev dataset for another evaluation attempt: resdsql_test.json

runnerup96 commented 1 year ago

Wow, the first suggestion actually worked!

Best EM ckpt: {'ckpt': './models/spider_no_others/text2sql-t5-base_1/checkpoint-101169', 'EM': 0.7156673114119922, 'EXEC': 0.7611218568665378} Best EXEC ckpt: {'ckpt': './models/spider_no_others/text2sql-t5-base_1/checkpoint-101169', 'EM': 0.7156673114119922, 'EXEC': 0.7611218568665378} Best EM+EXEC ckpt: {'ckpt': './models/spider_no_others/text2sql-t5-base_1/checkpoint-101169', 'EM': 0.7156673114119922, 'EXEC': 0.7611218568665378}

How come the original training procedure in scripts/train/text2sql/train_text2sql_t5_base.sh gave different results?

lihaoyang-ruc commented 1 year ago

I don't know the exact reason for your problem, but after I saw your training loss curve, I thought your training procedure was fine (because I used to look at it every day for a while).

Initial speculation is that there may be a problem with this piece of code in evaluate_text2sql_ckpts.py.

Specifically, I wrote this piece at the time to enable multi-process (multi-GPU) evaluation, and while it worked fine for me at the time, it does have some hidden problems. For example, if you have new checkpoints with the same name as checkpoints that have already been evaluated, our script will skip evaluating the new checkpoints and continue to keep the results of the old checkpoints.

Thanks for the tip, I'm going to submit a PR to fix this.

lihaoyang-ruc commented 11 months ago

This issue is being closed due to a lack of activity. However, if you continue to encounter problems and need further assistance, please don't hesitate to reopen it, and we will be more than happy to help with additional troubleshooting.

RUCKBReasoning / RESDSQL

Low training metrics #57