The accuracy of `SCR_grabo` seems to be unstable and not competitive with the results in paper

liggest commented 3 weeks ago

Hello, I've tested datasets in the SCR downstream task, as it's the first task mentioned in the result table of the paper.

Here's some of my results

SCR_google_speech_commands
Accuracy: 2906 / 3081 0.9432002596559559
SCR_grabo
Accuracy: 2262 / 3631 0.6229688790966675
SCR_grabo2
Accuracy: 707 / 3631 0.1947122004957312
SCR_grabo_learnable
Accuracy: 794 / 3631 0.2186725419994492
SCR_ar_speech_commands
Accuracy: 327 / 335 0.9761194029850746
SCR_lt_speech_commands
Accuracy: 89 / 98 0.9081632653061225

Basically I followed the README and only changed --downstream in each run, except for SCR_grabo_learnable, which I also applied --method learnable. I also grabbed some file from the SpeechPrompt v1 repository to make the code runnable on my machine.

The accuracy in my results is more or less off from the paper, especially for SCR_grabo. I didn't get a stable result of that dataset, as there's a gap between results of two runs. Also, neither freq method nor learnable method reach competitive performance with the results in paper (freq $0.924$ and learnable $0.927$).

I'd like to know why I haven't got the expected results, am I missing something? In addition, the paper also mentioned pGSLM, but I couldn't found related code in the repository. Could you tell me what to do to test the results related to that model? Thank you.

Below is the script I used for calculating accuracy of runs. I assigned a proper path in the base variable and just typed python accuracy.py to use it.

accuracy.py


import json
import sys
from pathlib import Path

base = Path("PATH/TO/GSLM/exp_results")

patterns = sys.argv[1:] or ["//samples"]

def count_correct(path: Path): with path.open("r", encoding="utf-8") as f: result: dict = json.load(f) correct = sum( one["label"] == one["predict"] for one in result.values() ) total = len(result) return correct, total

for ptn in patterns: for matched in base.glob(ptn): if (matched / "samples").is_dir(): j_path = matched / "samples" / "samples.json" elif (matched / "samples.json").is_file(): j_path = matched / "samples.json" elif matched.name == "samples.json" and matched.is_file(): j_path = matched else: print("no samples in", matched.as_posix()) continue

    experiment = j_path.parent.parent
    print(f"found samples at [{experiment.parent.name} / {experiment.name}]")
    correct, total = count_correct(j_path)
    print(f"Accuracy: {correct} / {total}", correct / total)

ga642381 commented 2 weeks ago

Hi @liggest

After diving into the problem, I found that to achieve a stable result for SCR_grabo, we have to set the argument "patience" to be a larger number. (ref: line)

This argument controls the patience for early stopping: after how many epochs the validation loss does not decrease, the training process will stop.

Setting a larger number usually (almost always) leads to better performance. I'm not quite sure why this particular task is so sensitive to the optimization process, but it turns out that for this task, the model requires more training epochs to identify proper soft prompts.

Thank you for reporting this. I will post my training curve.

ga642381 commented 2 weeks ago

Regarding pGSLM, the reason we didn't open source it is simply because we didn't find pGSLM has a significant advantage in the SpeechPrompt framework. Therefore, we transitioned to a more powerful model: Unit mBART, as discussed in SpeechGen (https://arxiv.org/abs/2306.02207), which we believe is a more robust speech LM suitable for various speech processing tasks.

We have already submitted a journal paper to TASLP focusing on prompting Unit mBART for (1) speech classification tasks, (2) sequence generation tasks, and (3) speech generation tasks. We also developed a more effective (and, I believe, more elegant) mechanism to connect the discrete units to the downstream classes, characters, and speech tokens. Once the paper is accepted, we will release the code.

liggest commented 2 weeks ago

Thanks for your kind reply. I'll try to tune the "patience" parameter to see whether I can get a better result.

It's also nice to know that you have a new paper, I'm looking forward to it.

liggest commented 2 weeks ago

With --patience=8, I found that it can produce a result that is close to the result in the paper. For some tasks, the model indeed requires more training epochs.

For SCR_grabo

patience	accuracy	last epoch
1	0.1947122004957312	58
2	0.6708895621041036	74
4	0.8785458551363261	93
8	0.92977141283393	107
10	0.9509776920958414	133

For SCR_grabo_learnable

patience	accuracy	last epoch
1	0.2186725419994492	58
8	0.92977141283393	97

For SCR_lt_speech_commands_learnable

patience	accuracy	last epoch
1	0.8571428571428571	57
8	0.9183673469387755	70

ga642381 / SpeechPrompt-v2

The accuracy of `SCR_grabo` seems to be unstable and not competitive with the results in paper #5