Questions for reproducing/ comparing with SpokenWOZ baselines.

ArneNx commented 8 months ago

Hello,

I am currently trying to evaluate models that I trained on SpokenWOZ in order to compare to the baselines you reported in the paper. Doing this, I'm currently running into some issues:

Which evaluation script should be used to report the results? I'm currently using this script from space-word and I'm failing to get numbers close to the ones you report (20% less than what you report for inform and success, while reaching higher BLEU score). Also, which settings do you use exactly for the final evaluation (e.g. how do you set same_eval_as_cambridge and use_true_domain_for_ctr_eval)?
Do you have the outputs or the trained model parameters of any of the baseline models available somewhere to verify the evaluation procedure (I need to adapt it to fit to my code base and want to check that I get the same results as you)?
When trying to run the training of space-word myself, the training runs for a few iterations and then crashes because the npy-file SNG1724 is missing. I see that the dialog exists in the original data, but it is not preprocessed correctly for some reason. Do you have an explanation for this?

S1s-Z commented 8 months ago

Thank you for being so interested in SpokenWOZ.

We used and modified SPACE's evaluation methods (e.g., adding slots, adding domains, etc.), but since the code was completed in 2022 and there are some details that we have forgotten, we encourage you to try to reproduce our code and debug to find out what parameters were passed in. Meanwhile, we also encourage you to implement specific measurements on your own, specifically you can refer to MultiWOZ_Evaluation and add the correct slots and domains, etc.
As illustrated in 1, we do not save the checkpoints of the baselines. You may consider trying to reproduce our code.
Please check the code in dataset.py and utils_dst.py. Meanwhile, check the trainListFile.json and valListFile.json are set up correctly. If this is still an issue, the easiest way is to not add that dialog in the training.

ArneNx commented 8 months ago

Thanks for the quick response. I managed to run the training now (still running at the moment). Turns out the trainListFile.json and valListFile.json were wrong. I created them based on the data you provided.

For the evaluation, I'm trying very much to be as comparable to your baselines as possible, so switching to another evaluation script might introduce some changes that prevent this.

ArneNx commented 8 months ago

I have now finished the training with the scripts you provided. So far I only have results on the validation set. However, in my experience the numbers are very close to the numbers on the test set. The resulting model has the following scores:

Policy optimization (checkpoint 25): 16.99 (BLEU), 54.8 (Success), 73.4 (Inform) 81.09 (Combined)
End2End (checkpoint 25): 17.84 (BLEU), 46.6 (Success), 61.8 (Inform), 72.04 (Combined)
DST (checkpoint 10): 24.4 (JGA)

Looking at these results, there is still quite some gap to the results reported in the paper (I'll also try to get results on the test set asap). What am I missing? Two questions came to mind:

Which pre-trained model of Space-3 did you use?
Did you maybe train for more than 25 epochs?

S1s-Z commented 8 months ago

Since our paper does not report the results on dev, we encourage you to try to reproduce our results on test. However, it is worth noting that since our dev and test set distributions as well as the number of dialogues (500 v.s. 1000) are not the same, this part of the difference is acceptable when the current metrics are all relatively low, e.g., the JGA is only 1% different. Also, since the model can be easily overfitted, perhaps you could test the checkpoints of different epochs on the response generation tasks to get better performance.

We used the SPACE in this link.
Because we are only trying to report the results of baselines and current models, we are not carefully training and tuning our reported baselines e.g., tuning the hyperparameters and the number of epochs.

ArneNx commented 8 months ago

Thanks for providing the base-model. I think I'm getting close to the original numbers now. Did you use the same checkpoint for evaluation of all three tasks (DST, E2E and PolicyEval) or did you select for each task separately based on validation performance?

Unfortunately, when I now analysed the outputs and scores from the training run with your scripts, I found two additional issues:

From the MultiWOZ and TOD literature, I gathered that it is common practice to evaluate BLEU on delexicalized responses and consequently also compare to delexicalized versions of the ground-truth transcriptions. However, your evaluation scripts (for space-word) compute BLEU based on the original references (not delexicalized). Fixing this gives an improvement by ~2 BLEU. As far as I can tell this is not intended and also doesn't fit the scoring for the zero-shot experiments you report.
The preprocessing for the training data results in misspelling in the slot [address] , leading to every slot of this kind to be mismatched! Correcting this in post-processing can give a significant improvement in success score.

As these issues make it hard to compare to the numbers from the paper, I now decided to take your advice from earlier and do the evaluation using the MultiWOZ_Evaluation script and rescore the baseline that I get from your training to still be comparable.

S1s-Z commented 8 months ago

For different tasks, as in previous works, we used different checkpoints to report the results. Thank you for pointing out the bugs in our code, and we will follow up and update the fixed results in the leaderboard and the Arxiv paper. For a fair comparison (including machine environments), you can give your reproduced results as final results in your paper.

ArneNx commented 8 months ago

Hey, thanks for your continuing commitment to the project!

I just want to emphasise again that in order to remain comparable to future results, a standardised evaluation framework would be required. So it would be great if you evaluate the new "fixed results" with the MultiWOZEvaluation tool as well. I'm planning to open PRs for both this repo (for the preprocessing of the ground-truth data) and the MultiWOZEvaluation tool soon to allow for this to be easily reproducible.

One more thing: I saw that you're skipping all "reference" slots in the success computation. What is the reasoning for this?

S1s-Z commented 7 months ago

Sorry for the delay. REFERENCE is the type of requestable slots to compute the SUCCESS score in MultiWOZ, which indicates the booking availability. In SpokenWOZ, we assume all of the user's bookings will be completed successfully (we ask the SYSTEM to make the successful bookings in the data collection), so we don't need to consider the reference slot in the success computation.

harisgulzar1 commented 2 months ago

@ArneNx @S1s-Z This thread issue has been quite informative for me. I wanna ask which script did you use to train Space on Spokenwoz dataset? Because the original version of the space fine-tuning scripts don't seem to include SpokenWOZ. I would appreciate if you can roughly tell me the steps you took, in case you have modified the existing scripts by yourself. Thanks!

S1s-Z commented 2 months ago

Check the scrips at this link: spokenwoz/Finetuning/space_baseline/space-3/scripts/train. Meanwhile, the scrips in the space-3 folder are used for training text-only baselines.

harisgulzar1 commented 2 months ago

@S1s-Z Thanks for your response. But before running the training script under spokenwoz/Finetuning/space_baseline/space-3/scripts/train, I need to prepare data by running the spokenwoz/Finetuning/space_baseline/data_process.sh which ultimately runs /spokenwoz/Finetuning/ubar/data_analysis.py and this code seems to be made for multi-woz dataset. I wonder if you can provide the modified code for spokenwoz data prepration for fine-tuning Space3. Or can you give me a pointer which code should I use for data preparation, to fine-tune space baselines? Thanks!

@ArneNx I would really appreciate any input from you as well.

harisgulzar1 commented 2 months ago

Also the training script /spokenwoz/Finetuning/space_baseline/space-3/run_gen.py seems to be made for Multi-Woz with no apparent mention of SpokenWoz dataset in it.

S1s-Z commented 2 months ago

The text data in SpokenWOZ is organized in the same way as MultiWOZ, so we keep the comments in the original SPACE code (Due to our laziness, we didn't reorganize the code of model SPACE, causing your misunderstanding).

In our released code, we have modified the slots and domains in the code of model SPACE used for training the model on SpokenWOZ, even though it is commented as MultiWOZ in the code. Therefore, you don't need to do additional data preprocessing.

harisgulzar1 commented 1 month ago

@S1s-Z Thanks for your comments. I was able to train Space-3 model on SpokenWOZ. I have evaluated the JGA of space-3 and I was able to get somewhat close to reported results on dev data.

Can you please answer the following questions for my further experiments?

Once I evaluate the above model for policy and e2e accuracy, the model seems not to be producing any output and accuracy is 0%. Do I need to train fine-tune the model with different parameters in train_space.sh? If so, what should be those parameters for policy optimization and e2e accuracy?
For evaluating the model on test dataset, the files in the testList.json don't seem to have associated dialogs in the dataset. Are test files part of the original SpokenWoz dataset? Where can I find test dialogs?

S1s-Z commented 1 month ago

Thanks for your interests.

What is the meaning of accuracy policy and e2e settings? If you want to reproduce the reported INFORM SUCCESS BLEU in our paper. Maybe you use the different checkpoints mentioned in our paper.
You can download testListFile.json in SpokenWOZ Text Test Set at https://spokenwoz.github.io/.

harisgulzar1 commented 1 month ago

Thanks for the quick response.

I mean, after training the SPACE-3. I run /dst/infer_space.sh and it shows JGA of around 12% which is fine, but when I run /e2e/infer_space.sh or /policy/infer_space.sh they produce 0% accuracy.
Thanks

S1s-Z commented 1 month ago

You mean you use the /e2e/infer_space.sh or /policy/infer_space.sh to get 0% JGA results? This is because /e2e/infer_space.sh or /policy/infer_space.sh are designed to compute the results of SUCCESS BLUE INFORM for the response generation task instead of the JGA result for the DST task. You can check the parameters such as USE_TRUE_PREV_BSPN and USE_TRUE_PREV_ASPN in the three scrips.

harisgulzar1 commented 1 month ago

No, I am not using /e2e/infer_space.sh and /policy/infer_space.sh to get JGA. What I meant is, running these scripts results into 0 value of all metrics. match: 0.00 success: 0.00 bleu: 0.00 score: 0.00. I am using the same model as trained for dst by running /train/train_space.sh. Should I use different setting for training to get the inference working correctly for SUCCESS BLUE INFORM? If yes, then what should I change in the paramters of the training script? If no, what could be the reasong for match: 0.00 success: 0.00 bleu: 0.00 score: 0.00?

And by the way, where can I find the explaination for paramters like USE_TRUE_PREV_BSPN, USE_TRUE_PREV_ASPN etc. Thanks.

harisgulzar1 commented 3 weeks ago

Hi @S1s-Z , your comments on the above question will be very appreciated. Thanks!

S1s-Z commented 3 weeks ago

Sorry for the delay, we checked the code before we uploaded the code. We didn't have a problem at the time. But, you can refer to our other baselines, for example, SPACE-word, and the relevant model can reproduce the results as mentioned above.

I will try to reproduce the resulted based on SPACE in next few days to check whether there are some mistakes.

In the meantime, did you try to use the code of SPACE (https://github.com/AlibabaResearch/DAMO-ConvAI/tree/main/space-3) to reproduce the results? You may modify the slots and domains to reproduce the results based on the original code of SPACE.

harisgulzar1 commented 3 weeks ago

@S1s-Z Thanks for your response. I will look into the original code of SPACE.

harisgulzar1 commented 3 weeks ago

@S1s-Z I will also be grateful if you point out the parameter settings in train_space.sh for DST and Dialog Generation Task so that I can also reproduce the results from my side.

AlibabaResearch / DAMO-ConvAI

Questions for reproducing/ comparing with SpokenWOZ baselines. #122