ReadAlongs / SoundSwallower

An even smaller speech recognizer / force aligner
Other
32 stars 4 forks source link

Front-End is quite bad #48

Open dhdaines opened 1 year ago

dhdaines commented 1 year ago

First of all because it has a couple of layers of historical cruft around it ... you pass a config_t which then gets "parsed" into a separate parameters structure, which then gets "parsed" into actual parameters.

But more importantly because everything is directed from this top-level configuration, so once you have configured it, it will only give you one particular kind of features. This in particular makes the case where we'd like to do visualization (using power-spectrum, mel-spectrum, or smoothed spectrum) and also recognition (using MFCC) at the same time non-possible.

Also there are like 5 kinds of VTLN, none of which we use, and probably none of which we will ever use.

And finally because the code which does the buffering and windowing is just utter trash (and I should know because I wrote it).