Low performance when using demonstration retrieval

matthewclso commented 3 months ago

I could not find the demonstration retrieval code in code/data_process.py, so I wrote my own implementation, using sentence_transformers and the all-mpnet-base-v2 model. I followed the method outlined in the paper in which I find the top 1 corresponding utterance in the training set, using same-label pairing for training and all-labels pairing for dev/test.

Without demonstration retrieval, I was able to come close to the paper's result. However, with demonstration retrieval, my F1-scores decreased all the way from ~69 to ~25 for MELD (and similar results for other datasets). I find the model is just outputting the demonstration emotion while training, in order to reduce the loss as much as possible, which leads to extreme overfitting. However, in validation, this almost never works since it uses all-labels pairing.

Do you have the demonstration retrieval code we can look at? It seems it is not effective.

zehuiwu commented 2 months ago

Hi Matthew! How did you manage to get 69? I used the default settings and tried many seeds. But the average F1 is around 65-66. The best seed gives around 67-68.

I run each seed for 15 epochs instead of 6 to get better scores.

I also tried the auxiliary tasks, but they were not effective as claimed in the paper. For both datasets, the scores are lower than they claimed, especially for meld.

I wonder what settings you use to get this score? Is it just a lucky seed or is the average 69?

Thank you so much!

matthewclso commented 2 months ago

Hi Matthew! How did you manage to get 69? I used the default settings and tried many seeds. But the average F1 is around 65-66. The best seed gives around 67-68.

I run each seed for 15 epochs instead of 6 to get better scores.

I also tried the auxiliary tasks, but they were not effective as claimed in the paper. For both datasets, the scores are lower than they claimed, especially for meld.

I wonder what settings you use to get this score? Is it just a lucky seed or is the average 69?

Thank you so much!

Hi Zehui! I must've gotten lucky. My runs average around 65-66 as well, so it appears that not only can we not reproduce the results for demonstration retrieval, but we also can't reproduce the other results as well. This is quite concerning

LIN-SHANG / InstructERC

Low performance when using demonstration retrieval #15