Some Questions About Evaluation Methods?

tom68-ll commented 1 year ago

Hello, author. Thank you for your work. We have some questions about the final evaluation method. Could you please explain how to calculate the results in Table 1 of the paper after obtaining the four lists from the code below? https://github.com/Bahuia/SSCL-Text2SQL/blob/99a28dd0d8b5d4dc61a1bb5d223f6ec8b8cbbedf/train_sfnet.py#L44 Specifically, do we need to calculate the average of the ten values in avg_acc_list and whole_acc_list separately to obtain ACC_a and ACC_w in Table 1? Furthermore, how do we calculate BWT and FWT in Table 1 from bwt_list and fwt_list?

tom68-ll commented 1 year ago

Hello, author. We are eager to know whether the results reported in Table 1 of the paper represent the average performance across ten tasks or the model's performance on the final task. We look forward to receiving your response. Thank you！

Bahuia commented 1 year ago

Hello, author. We are eager to know whether the results reported in Table 1 of the paper represent the average performance across ten tasks or the model's performance on the final task. We look forward to receiving your response. Thank you！

Sorry for the late reply! We have reported the performance of the model after training for the final task.

tom68-ll commented 1 year ago

Thank you very much for your response. We have another question. Are the reported results in the paper obtained using the configurations specified in the two .sh files in the repository? We attempted to replicate the experiments on WikiSQL but did not achieve the desired performance (it was even worse than fine-tuning methods). We hope to receive your guidance. Thank you. @Bahuia

Bahuia commented 1 year ago

Thank you very much for your response. We have another question. Are the reported results in the paper obtained using the configurations specified in the two .sh files in the repository? We attempted to replicate the experiments on WikiSQL but did not achieve the desired performance (it was even worse than fine-tuning methods). We hope to receive your guidance. Thank you. @Bahuia

I apologize for the inconvenience caused. I recently refactored the original code, but I only tested it on Spider. I have been busy with submission deadlines recently, so please allow us some time to make the necessary adjustments.

tom68-ll commented 1 year ago

Thank you! We also look forward to your follow-up adjustments. In addition, we have some uncertainties about the evaluation strategy 'FWT' in the code, as follows: https://github.com/Bahuia/SSCL-Text2SQL/blob/99a28dd0d8b5d4dc61a1bb5d223f6ec8b8cbbedf/sfnet/basic_trainer.py#L248 Regarding 'acc_rand_list,' in the paper, it refers to the accuracy of a randomly initialized model on the respective task's Test set. However, in the code, we noticed that it is consistently defined as 0. We can imagine that a randomly initialized model is unlikely to achieve any accuracy on the Test-to-SQL task, so it remains 0. Is this understanding correct?

Bahuia commented 1 year ago

Thank you! We also look forward to your follow-up adjustments. In addition, we have some uncertainties about the evaluation strategy 'FWT' in the code, as follows:

https://github.com/Bahuia/SSCL-Text2SQL/blob/99a28dd0d8b5d4dc61a1bb5d223f6ec8b8cbbedf/sfnet/basic_trainer.py#L248

Regarding 'acc_rand_list,' in the paper, it refers to the accuracy of a randomly initialized model on the respective task's Test set. However, in the code, we noticed that it is consistently defined as 0. We can imagine that a randomly initialized model is unlikely to achieve any accuracy on the Test-to-SQL task, so it remains 0. Is this understanding correct?

Yes. Since we have actually evaluated acc_rand_list several times before, but it was 0 every time, we'll just omit that part here to save training time.

tom68-ll commented 1 year ago

Thank you for your patient responses. We apologize for any inconvenience caused, and we look forward to your prompt resolution of the bug on WikiSQL.

tom68-ll commented 1 year ago

Dear author, We further tested the method presented in your paper on WikiSQL, but the results were not satisfactory. We made efforts to rectify the issue, but still haven't found an effective solution. May I inquire when you expect this problem to be resolved?

Bahuia commented 1 year ago

Dear author, We further tested the method presented in your paper on WikiSQL, but the results were not satisfactory. We made efforts to rectify the issue, but still haven't found an effective solution. May I inquire when you expect this problem to be resolved?

I apologize for being busy all this time. I will strive to provide you with a response within the next two to three weeks.

Bahuia commented 1 year ago

Dear author, We further tested the method presented in your paper on WikiSQL, but the results were not satisfactory. We made efforts to rectify the issue, but still haven't found an effective solution. May I inquire when you expect this problem to be resolved?

Hi, I have addressed the issue of the wikisql results. The bug primarily stems from the to_examples function located in sfnet/utils.py. The input enhancement with the linked schema performed by IRNet for the Spider dataset is not applicable to the WikiSQL dataset. https://github.com/Bahuia/SSCL-Text2SQL/blob/a6a11b918d169c8bfbf2e6df387d76dde85688fe/sfnet/utils.py#L343-L393 In my initial release, I mistakenly implemented it for both WikiSQL and Spider, resulting in unexpected outcomes. Now you can directly run the train_wikisql.sh script to achieve the results reported in the paper.

tom68-ll commented 1 year ago

Thank you very much for your response and efforts.

Bahuia / SSCL-Text2SQL

Some Questions About Evaluation Methods? #2