RoboTutorLLC / RoboTutor_2019

Main code for RoboTutor. Uploaded 11/20/2018 to XPRIZE from RoboTutorLLC/RoboTutor.
Other
7 stars 4 forks source link

1.8.9.1 check Sphinx models and parameter values #300

Open JackMostow opened 6 years ago

JackMostow commented 6 years ago

RoboTutor is using standard semi-continuous US adult models, not continuous models trained on kids' oral reading. And it may be using a bad set of parameter values. I need to find them in the code.

octavpo commented 6 years ago

I was wondering about those parameters too. They are in method setupRecognizer in file /Projects/RoboTutor/comp_listener/src/main/java/edu/cmu/xprize/listener/ListenerBase.java. But we'd need documentation on them, I don't know what they mean.

And it surely would be great if we could train it on actual Swahili speaking kids if we have the data and we know how to train it. Maybe we can get help from the authors (or maybe you already know how to do it).

I was also wondering if there's a newer version we could upgrade to. I found these repositories, but I don't know if any of them contain something better than what we have or not: https://github.com/cmusphinx/pocketsphinx https://github.com/cmusphinx/pocketsphinx-android

octavpo commented 6 years ago

One other thing I was wondering about is what problem we're trying to solve, false positives or false negatives. It seemed to me Judith was mainly complaining about false negatives. She didn't say that explicitly but in previous testing I was witnessing she and Kevin seemed to have issues with having their Swahili words recognized. I don't know if we also have observations showing African kids having problems with having their words recognized. In my own testing I didn't have much problems with that, very rarely I had to repeat a word. I had more issues with false positives, like when I was coughing and the whole sentence was recognized.

I can think of two possible explanations about the difference between my testing and Judith/Kevin's testing, may there are better ones, but those would suggest different solutions:

The first explanation would suggest we do nothing, unless we have evidence about the issue from African kids. The second would suggest the noise suppression idea would have a better effect, if we can figure out how to use it.

JackMostow commented 6 years ago

Thanks for locating it. Notes for future:

  1. Let's find out more by changing this flag in public class ListenerBase: protected boolean IS_LOGGING = false;

  2. Not for right now, but we can gain flexibility by decoupling text words from recognized words, i.e. dropping the implicit constraint that the words displayed on the screen are identical to the words passed to the Listener. For example, this capability would be useful when we want to display a numeral, e.g. "5", and listen for a word, e.g. FIVE, or display a letter, e.g. "Ch", and listen for its name, e.g. CHE. A stronger version would drop the requirement of a 1-1 mapping between text words and "listener" words, e.g. display a multi-digit number ("25") and listen for a multi-word phrase (TWENTY FIVE). See "Currently assuming hyphenated expressions split into two Asr words" in comp_reading/src/main/java/cmu/xprize/rt_component/CRt_ViewManagerASB.java.

  3. I finally found the parameter values -- at least the default -- in RoboTutor/comp_listener/src/main/java/edu/cmu/xprize/listener/ListenerBase.java, in protected void setupRecognizer(File assetsDir, File configFile, String langDictionary).

  4. I noticed this option to keep audio input for each utterance -- useful for evaluating ASR accuracy: // this automatically logs raw audio to the specified directory: .setRawLogDir(assetsDir)

  5. I didn't find any calls to SpeechRecognizerSetup with a non-default configFile. Can you? // if caller specified a configFile, take parameters from that. // In this config file must specify all non-default pocketsphinx parameters if (configFile != null) { recognizer = SpeechRecognizerSetup.setupFromFile(configFile).getRecognizer();

  6. These default values actually mean something to me. I'll check them against values I've used in off-line experiments.

            switch(acousticModel) {
                case LCONST.KIDS:
    
                    // create pocketsphinx SpeechRecognizer using the SpeechRecognizerSetup factory method
    
                    recognizer = SpeechRecognizerSetup.defaultSetup()
                            // our pronunciation dictionary
                            .setDictionary(new File(modelsDir, "lm/" + langDictionary))
    
                            // our acoustic model
                            .setAcousticModel(new File(modelsDir, "hmm/en-con-ind"))
    
                            // this automatically logs raw audio to the specified directory:
                            .setRawLogDir(assetsDir)
                            .setBoolean("-verbose", true)            // maximum log output
    
                            .setFloat("-samprate", 16000f)
    
                            .setInteger("-nfft", 512)
    
                            .setInteger("-frate", 100)
    
                            .setFloat("-lowerf", 50f)
    
                            .setFloat("-upperf", 6800f)
    
                            .setBoolean("-dither", true)
    
                            .setInteger("-nfilt", 40)
    
                            .setInteger("-ncep", 13)
    
                            .setString("-agc", "none")
                            .setFloat("-ascale", 1f)                // 20 in default
                            .setBoolean("-backtrace", true)         // no in default
    
                            .setDouble("-beam", 1e-80)              // 1e-48 in default
    
                            .setBoolean("-bestpath", false)         // yes in default
    
                            //.setString("-cmn", "current")
                            .setString("-cmn", "prior")
                            .setBoolean("-compallsen", false)
                            .setBoolean("-dictcase", false)
                            .setFloat("-fillprob", 1e-2f)           // 1e-8 in default
                            .setBoolean("-fwdflat", false)          // yes in default
                            .setInteger("-latsize", 5000)
                            .setFloat("-lpbeam", 1e-5f)             // 1e-40 in default
    
                            .setDouble("-lponlybeam", 7e-29)        //
    
                            .setFloat("-lw", 10f)                   // 6.5 in default
                            .setInteger("-maxhmmpf", 1500)          // 10000 in default
                            //.setInteger("-maxnewoov", 5000)         // 20 in default
    
                            .setDouble("-pbeam", 1e-80)             // 1e-48 in default
    
                            .setFloat("-pip", 1f)
    
                            .setBoolean("-remove_noise", true)     // yes in default
                            .setBoolean("-remove_silence", true)   // yes in default
    
                            .setFloat("-silprob", 1f)               // 0.005 in default
                            .setInteger("-topn",  4)
    
                            .setDouble("-wbeam", 1e-60)             // 7e-29 in default
    
                            .setFloat("-wip",  1f)                  // 0.65 in default
JackMostow commented 6 years ago

See ASR