`fe_process_frames_ext` can discard speech data?

wutiantong commented 7 years ago

__fe_interface.c__ Line 490-498

    /* Try to read from prespeech buffer */
    if (fe->vad_data->in_speech && fe_prespch_ncep(fe->vad_data->prespch_buf) > 0) {
        outidx = fe_copy_from_prespch(fe, inout_nframes, buf_cep, outidx);
        if ((*inout_nframes) < 1) {
            /* mfcc buffer is filled from prespeech buffer */
            *inout_nframes = outidx;
            return 0;
        }
    }

If *inout_nframes < prespch_buf's ncep, code will return from here, while the input of speech data is totally ignored. I have verified this case, it seems a bug.

wutiantong commented 7 years ago

same problem happened at Line 525-535

/* Process all remaining frames. */
    while (*inout_nframes > 0 && *inout_nsamps >= (size_t)fe->frame_shift) {
        fe_shift_frame(fe, *inout_spch, fe->frame_shift);
        fe_write_frame(fe, buf_cep[outidx], voiced_spch != NULL);

    outidx = fe_check_prespeech(fe, inout_nframes, buf_cep, outidx, out_frameidx, inout_nsamps, orig_nsamps);

        /* Update input-output pointers and counters. */
        *inout_spch += fe->frame_shift;
        *inout_nsamps -= fe->frame_shift;
    }

If fe_write_frame has changed vad_data->in_speech(false -> true), fe_check_prespeech can completely exhaust inout_nframes with vad_data->prespch_buf, then terminate this while loop halfway - remained speech data would be skipped, even though the following code try to handle __overflow_samps__. I'm sure some speech data is skipped here.

nshmyrev commented 7 years ago

Honestly there are so many issues here. Yes, sometimes data is skipped. We actually desperately need a frontend rework, not simply bug fixing, a totally new architecture with proper estimation of parameters is required. If you are interested to work on this, I can outline the design in a document.

wutiantong commented 7 years ago

Good to hear that. Yes, I'm interested, however, probably lack of experience on this work. I can't promise, but I'll try my best.

dhdaines commented 2 years ago

In my opinion, despite what is claimed on https://cmusphinx.github.io/wiki/faq/, noise suppression should be done externally. The VAD and noise removal code has added even more complexity to the frontend which was already too complex. Particularly since for a live application we do not want to even manage the audio input at all as it will be done by some external audio graph/pipeline like GStreamer, and this is how it is done on all platforms for quite some time now. Putting VAD in the gst-plugin was the right idea.

Given that PocketSphinx development is essentially abandoned we should revert to the 0.8 frontend code, particularly since alignment in batch mode is actually a common use case, and we do not want to ever discard any input in that case.

We should also discard the audio library entirely as its API is backwards for any modern platform where audio is always pushed to a processing node. The feature extractor should extract features and do nothing else. This is what I have done in SoundSwallower for instance: https://github.com/ReadAlongs/SoundSwallower

cmusphinx / sphinxbase

`fe_process_frames_ext` can discard speech data? #41