Closed arlofaria-zoom closed 4 months ago
So it's a little more complicated to get deterministic features than just turning off dither, as another source of variability is the lossy compression that's done on feature archives, which I had to disable along with dither for deterministic feature testing here: https://github.com/mmcauliffe/kalpy/blob/main/tests/test_mfcc.py#L122.
With that said, it'd be helpful for me to understand why this is necessary in the first place, what's the use case that requires deterministic features? In general the dithering/compression shouldn't affect alignments, or at least I didn't observe a big effect for the kalpy testing.
Thanks for the reply!
In general the dithering/compression shouldn't affect alignments, or at least I didn't observe a big effect for the kalpy testing.
The dithering absolutely has an effect, which can be verified objectively. Whether it’s considered “big” is subjective.
I don’t see why the lossy compression should matter, though, so long as it’s deterministic. What randomness is involved in that?
I agree: achieving determinism is complicated, and disabling dithering isn’t even the right approach. (A better solution is to tweak Kaldi so that the RNG can be re-seeded.)
There are other sources of non-determinism to consider, such as in the math library. For example, you may need to set MKL_CBWR=COMPATIBLE
to get truly bitwise reproducibility on a given CPU architecture.
That said, there are many use cases that would benefit from determinism: probably the most basic of these is “science”. It’s a general aim for experiments to have reproducible results. It can be really frustrating when running the same experiment twice gives different results, which sometimes lead to different conclusions. Another use case is continuous integration testing pipelines: you might compare a system output against an expected result and fail the CI check if there are any differences.
Much more specifically: the feature-level dithering was causing differences in alignments that affected WER scoring using NIST’s SCTK software for scoring ASR systems. The reason is that SCTK is quite strict about scoring hypothesis words that are strictly timestamped to be within the time intervals of a reference’s utterance-level segmentation. Disabling the dither has now resulted in deterministically reproducible results. This has in turn given a team of software developers and researchers peace of mind that they are testing the same systems and getting expected results.
Hope that makes sense! :)
Fixed in #761
Many thanks for the fix, @mmcauliffe !!! 👍
This isn't the ideal solution, as it's accessing a private attribute
._meta
, but it's a quick workaround.