ebu / benchmarkstt

Open Source AI Benchmarking toolkit for benchmarking speech to text services
MIT License
54 stars 8 forks source link

Define and document diff algorithm #30

Closed EyalLavi closed 5 years ago

EyalLavi commented 5 years ago

The algorithm identifies substitutions, deletions, insertions and correct words. In v1, whole words only.

In the documentation, include pseudo-code and a discussion of the chosen algo.

amessina71 commented 5 years ago

I think pointing at a good resource like https://en.wikipedia.org/wiki/Edit_distance could be enough.

MikeSmithEU commented 5 years ago

I think pointing at a good resource like https://en.wikipedia.org/wiki/Edit_distance could be enough.

It is a two part problem: first generating the "delta" and then using that delta to calculate the WER. I've described how I see documenting WER in #31

The WER part is quite clear and non-confusing, however there is some variance in determining the deltas, and how to solve the LCS problem ( https://en.wikipedia.org/wiki/Longest_common_subsequence_problem ). This can impact the WER score.

To compare, sclite supports both unix diff and "dynamic programming string alignment", ref. http://www1.icsi.berkeley.edu/Speech/docs/sctk-1.2/sclite.htm#dynam_prog_0

I propose using the Hunt–McIlroy algorithm ( https://en.wikipedia.org/wiki/Hunt%E2%80%93McIlroy_algorithm ) for now (this is what is used by diff and is well established), we might add DP string alignment later or leave it to the community to implement.

Internal note (not necessarily for v1), future other easily implemented metrics:

EyalLavi commented 5 years ago

The Hunt–McIlroy algorithm is the one used internally by Python. Link to https://docs.python.org/3/library/difflib.html