Closed runnerup96 closed 11 months ago
Hello,
Is your setup identical to ours? We conduct model training on the Ubuntu operating system, and the specifics of our Python environment are detailed in the README file.
Additionally, since RESDSQL operates in two stages, you have the option to substitute one of the stages with our published checkpoints. This should help determine which part is yielding incorrect results.
Have you conducted evaluations on all the intermediate checkpoints for the T5-base model? Your results appear to indicate that only the first checkpoint was assessed.
Hello,
Is your setup identical to ours? We conduct model training on the Ubuntu operating system, and the specifics of our Python environment are detailed in the README file.
Additionally, since RESDSQL operates in two stages, you have the option to substitute one of the stages with our published checkpoints. This should help determine which part is yielding incorrect results.
Yes, I've followed README environment installation steps, so I have the same environment.
At first, I've have used your published checkpoints for inference and did get the same results as in the paper. Its the training metrics that do not match in my case.
Have you conducted evaluations on all the intermediate checkpoints for the T5-base model? Your results appear to indicate that only the first checkpoint was assessed.
I've conducted evaluation on the best checkpoint after 128 epochs.
Have you conducted evaluations on all the intermediate checkpoints for the T5-base model? Your results appear to indicate that only the first checkpoint was assessed.
I've conducted evaluation on the best checkpoint after 128 epochs.
What you are saying is that you have evaluated all intermediate checkpoints (there are a total of 85 checkpoints for training 128 epochs) and the performance of the best of them is EM 0.489 and EX 0.527?
What you are saying is that you have evaluated all intermediate checkpoints (there are a total of 85 checkpoints for training 128 epochs) and the performance of the best of them is EM 0.489 and EX 0.527?
For evaluation I have used the original evaluate_text2sql_ckpts.py
for all saved checkpoints.
Are you using exact these initial checkpoints?
https://huggingface.co/roberta-large https://huggingface.co/t5-base
Are you using exact these initial checkpoints?
https://huggingface.co/roberta-large https://huggingface.co/t5-base
Yes
Can you show your TensorBoard logs for training the schema item classifier and T5-base? Other than that, can you share the evaluation results if it's convenient? Like this:
Sure!
I decided to spot the source of low training error so I took your item classifier checkpoint(text2sql_schema_item_classifier) and started training just text2sql stage. I get about same metrics as before:
Best EM ckpt: {'ckpt': './models/spider_no_others/text2sql-t5-base_1/checkpoint-101169', 'EM': 0.48936170212765956, 'EXEC': 0.5270793036750484} Best EXEC ckpt: {'ckpt': './models/spider_no_others/text2sql-t5-base_1/checkpoint-101169', 'EM': 0.48936170212765956, 'EXEC': 0.5270793036750484} Best EM+EXEC ckpt: {'ckpt': './models/spider_no_others/text2sql-t5-base_1/checkpoint-101169', 'EM': 0.48936170212765956, 'EXEC': 0.5270793036750484}
Looks like the problem in training t5 model.
For space optimization i keep only last 10 checkpoints.
I attach my training loss pictures from tensorboard and csv with loss history.
Your training losses appear to be correct.
I suspect the issue may lie with the evaluation scripts or the pre-processed dataset.
Could you delete the evaluation result files (named checkpoints-*.txt) and rerun the evaluate_text2sql_ckpts.py
script?
If the performance remains atypical, I recommend using our pre-processed spider-dev dataset for another evaluation attempt: resdsql_test.json
Wow, the first suggestion actually worked!
Best EM ckpt: {'ckpt': './models/spider_no_others/text2sql-t5-base_1/checkpoint-101169', 'EM': 0.7156673114119922, 'EXEC': 0.7611218568665378} Best EXEC ckpt: {'ckpt': './models/spider_no_others/text2sql-t5-base_1/checkpoint-101169', 'EM': 0.7156673114119922, 'EXEC': 0.7611218568665378} Best EM+EXEC ckpt: {'ckpt': './models/spider_no_others/text2sql-t5-base_1/checkpoint-101169', 'EM': 0.7156673114119922, 'EXEC': 0.7611218568665378}
How come the original training procedure in scripts/train/text2sql/train_text2sql_t5_base.sh
gave different results?
I don't know the exact reason for your problem, but after I saw your training loss curve, I thought your training procedure was fine (because I used to look at it every day for a while).
Initial speculation is that there may be a problem with this piece of code in evaluate_text2sql_ckpts.py
.
Specifically, I wrote this piece at the time to enable multi-process (multi-GPU) evaluation, and while it worked fine for me at the time, it does have some hidden problems. For example, if you have new checkpoints with the same name as checkpoints that have already been evaluated, our script will skip evaluating the new checkpoints and continue to keep the results of the old checkpoints.
Thanks for the tip, I'm going to submit a PR to fix this.
This issue is being closed due to a lack of activity. However, if you continue to encounter problems and need further assistance, please don't hesitate to reopen it, and we will be more than happy to help with additional troubleshooting.
Hello!
I am trying to reproduce the training procedure of training RESDSQL with T5-base on text2sql task. I took the original
train.json
anddev.json
SPIDER files from leaderboard (https://yale-lily.github.io/spider). I follow training procedure given in the README file.I took the default script parameters from
scripts/train/text2sql
folder:I trained for 128 epochs and evaluated using this repo evaluation script and reevaluated with https://github.com/taoyds/test-suite-sql-eval
I get EM 0.489 and EX 0.527 - while in the paper its stated that this configuration achieves EM 0.717 and EX 0.779 on dev set. Can you please point me where I could get the training wrong? I am evaluating different text2sql model for OOD robustness and choose your solution as one of the current SoTA in that domain.