Closed liggest closed 2 weeks ago
Hi @liggest
After diving into the problem, I found that to achieve a stable result for SCR_grabo, we have to set the argument "patience" to be a larger number. (ref: line)
This argument controls the patience for early stopping: after how many epochs the validation loss does not decrease, the training process will stop.
Setting a larger number usually (almost always) leads to better performance. I'm not quite sure why this particular task is so sensitive to the optimization process, but it turns out that for this task, the model requires more training epochs to identify proper soft prompts.
Thank you for reporting this. I will post my training curve.
Regarding pGSLM, the reason we didn't open source it is simply because we didn't find pGSLM has a significant advantage in the SpeechPrompt framework. Therefore, we transitioned to a more powerful model: Unit mBART, as discussed in SpeechGen (https://arxiv.org/abs/2306.02207), which we believe is a more robust speech LM suitable for various speech processing tasks.
We have already submitted a journal paper to TASLP focusing on prompting Unit mBART for (1) speech classification tasks, (2) sequence generation tasks, and (3) speech generation tasks. We also developed a more effective (and, I believe, more elegant) mechanism to connect the discrete units to the downstream classes, characters, and speech tokens. Once the paper is accepted, we will release the code.
Thanks for your kind reply. I'll try to tune the "patience" parameter to see whether I can get a better result.
It's also nice to know that you have a new paper, I'm looking forward to it.
With --patience=8
, I found that it can produce a result that is close to the result in the paper. For some tasks, the model indeed requires more training epochs.
For SCR_grabo
patience | accuracy | last epoch |
---|---|---|
1 | 0.1947122004957312 | 58 |
2 | 0.6708895621041036 | 74 |
4 | 0.8785458551363261 | 93 |
8 | 0.92977141283393 | 107 |
10 | 0.9509776920958414 | 133 |
For SCR_grabo_learnable
patience | accuracy | last epoch |
---|---|---|
1 | 0.2186725419994492 | 58 |
8 | 0.92977141283393 | 97 |
For SCR_lt_speech_commands_learnable
patience | accuracy | last epoch |
---|---|---|
1 | 0.8571428571428571 | 57 |
8 | 0.9183673469387755 | 70 |
Hello, I've tested datasets in the SCR downstream task, as it's the first task mentioned in the result table of the paper.
Here's some of my results
Basically I followed the README and only changed
--downstream
in each run, except forSCR_grabo_learnable
, which I also applied--method learnable
. I also grabbed some file from the SpeechPrompt v1 repository to make the code runnable on my machine.The accuracy in my results is more or less off from the paper, especially for
SCR_grabo
. I didn't get a stable result of that dataset, as there's a gap between results of two runs. Also, neitherfreq
method norlearnable
method reach competitive performance with the results in paper (freq
$0.924$ andlearnable
$0.927$).I'd like to know why I haven't got the expected results, am I missing something? In addition, the paper also mentioned pGSLM, but I couldn't found related code in the repository. Could you tell me what to do to test the results related to that model? Thank you.
Below is the script I used for calculating accuracy of runs. I assigned a proper path in the
base
variable and just typedpython accuracy.py
to use it.base = Path("PATH/TO/GSLM/exp_results")
patterns = sys.argv[1:] or ["//samples"]
def count_correct(path: Path): with path.open("r", encoding="utf-8") as f: result: dict = json.load(f) correct = sum( one["label"] == one["predict"] for one in result.values() ) total = len(result) return correct, total
for ptn in patterns: for matched in base.glob(ptn): if (matched / "samples").is_dir(): j_path = matched / "samples" / "samples.json" elif (matched / "samples.json").is_file(): j_path = matched / "samples.json" elif matched.name == "samples.json" and matched.is_file(): j_path = matched else: print("no samples in", matched.as_posix()) continue