Define and document diff algorithm

EyalLavi commented 5 years ago

The algorithm identifies substitutions, deletions, insertions and correct words. In v1, whole words only.

In the documentation, include pseudo-code and a discussion of the chosen algo.

amessina71 commented 5 years ago

I think pointing at a good resource like https://en.wikipedia.org/wiki/Edit_distance could be enough.

MikeSmithEU commented 5 years ago

I think pointing at a good resource like https://en.wikipedia.org/wiki/Edit_distance could be enough.

It is a two part problem: first generating the "delta" and then using that delta to calculate the WER. I've described how I see documenting WER in #31

The WER part is quite clear and non-confusing, however there is some variance in determining the deltas, and how to solve the LCS problem ( https://en.wikipedia.org/wiki/Longest_common_subsequence_problem ). This can impact the WER score.

To compare, sclite supports both unix diff and "dynamic programming string alignment", ref. http://www1.icsi.berkeley.edu/Speech/docs/sctk-1.2/sclite.htm#dynam_prog_0

I propose using the Hunt–McIlroy algorithm ( https://en.wikipedia.org/wiki/Hunt%E2%80%93McIlroy_algorithm ) for now (this is what is used by diff and is well established), we might add DP string alignment later or leave it to the community to implement.

Internal note (not necessarily for v1), future other easily implemented metrics:

Weighted WER with 0.5 weights on insertions and deletions
Word Accuracy (i.e. 1 - WER)

EyalLavi commented 5 years ago

The Hunt–McIlroy algorithm is the one used internally by Python. Link to https://docs.python.org/3/library/difflib.html

ebu / benchmarkstt

Define and document diff algorithm #30