Open CuiMingyu opened 2 years ago
An example of swbd glm:
The short answer is, YES and NO.
Actually this is a pretty good question that I'm gonna keep this thread open forever for documentation purposes. And here is the long answer:
On No
side:
The reason why we don't provide a GLM within GigaSpeech, is that we don't want to mess up the evaluation process with too complex sub-systems(such as TN & Context-Dependent language rewritings), so that downstream research toolkits can integrate and adopt GigaSpeech like a fresh air.
And as you mentioned, we do provide a very simple script containing our recommended text post-processing here, see discussion https://github.com/SpeechColab/GigaSpeech/issues/24 , and it should provide a reliable apple-to-apple basis for academic comparisons.
On Yes
side:
Taking ASR benchmarking more seriously, like real-life ASR scenarios, we developed a universal benchmarking platform, that contains modules such as:
They are in our Leaderboard project repo, there you can find a GLM file containing hundreds of rewriting rules already, for English in general, not limited to GigaSpeech. You can help us to improve it if you'd like to, it's an asset for the entire speech community.
Here is a glance of dummy outputs from the scoring tool:
As you can see, raw form of WE ARE
are transformed to WE'RE
, as the result of a GLM rule WE'RE <-> WE ARE
, to match with reference on-the-fly. And we even managed to tag these alternative expansions with #
and pretty-aligned, so that error analysis becomes crystal clear.
Hi sir,
does gigaspeech provide a glm file like swbd en20000405_hub5.glm containing the transcript filtering rules?
I notice there are some rules in gigaspeech_scoring.py file. But do you have the glm file about all the rules?
Thanks a lot!