Special normalisation rules

ebu / benchmarkstt

Open Source AI Benchmarking toolkit for benchmarking speech to text services

MIT License

54 stars 8 forks source link

Special normalisation rules #37

Open amessina71 opened 5 years ago

amessina71 commented 5 years ago

How to deal with: 1) numbers 2) acronyms / symbols 3) website / email spellings (e.g. use "dot", "at" )

I would go for trying as much as possible to have letter-based normalised representations of all the above such as:

1) 100 -> one hundred (cento, cent) 2) Hz -> (hertz), WHO -> double u aitch o (less sure about this one ...) 3) www.rai.it -> vu vu vu punto rai punto it, pippo@pluto.com -> pippo at pluto dot com

of course this would be for the sake of comparison, no one would really like to have such transcripts as a final product ... we don't even need to output normalised text if not for a debug session.

MikeSmithEU commented 5 years ago

I would do the opposite actually in all cases, i.e. going from complexity to simplicity.

one hundred -> 100
hertz -> Hz, double u aitch o (not sure any stt actually outputs this)
vu vu vu punto ... (same as 2, not sure any stt actually outputs this)

especially in the case of the vu vu vu punto rai punto it, if this is transcribed wrongly, eg. "vuvuvupunto raai punto it", it would impact WER heavily as it has 5 "words" wrong, while imho this should only count as one "word", and one point "penalty" on the WER score...

EyalLavi commented 5 years ago

I'm concerned that specific normalisation discussions can send up down a very deep rabbit hole. Like @amessina71 says, the thing to remember is that we are not interested in absolute WER (compared to reference) but in relative WER (compared to other vendors). So as long as we apply the normalisation consistently it's not so important how we normalise. My preference would be to have a very small number of core normalisation rules that each user can add to. We also don't have to reinvent the wheel. There may be something we can use from these sources: https://www.kaggle.com/headsortails/watch-your-language-update-feature-engineering, https://github.com/google/sparrowhawk

amessina71 commented 5 years ago

@MikeSmithEU I actually meant the opposite too. Engines would certainly output "www.rai.it" or "100". The problem is that there might be slight differences with one another. One can output "www.rai.it" the other "www dot rai dot it". How to compare them? Again, one could output "100" the other "one hundred". So, by having a common denominator for these kinds of normalisations would make that engine 1 saying "There were 100 members attending" and engine 2 saying "There were one hundred persons attending" would be considered equivalent.