Closed DragonHunter274 closed 2 months ago
In general, I don't expect a general purpose single-board-computer to be able to handle this. It would not be something that Mycroft Core will directly implement.
I believe it is best to segment the work into distinct parts as much as possible. So things like beamforming, noise cancellation, etc should be confined to dedicated microphone technology and should just present themselves to Mycroft as a single audio stream.
Sorry to necro this but I am actually wondering now we have the Pi4 that this is actualy really possible and also really usefull.
Agree beamforming, noise cancellation, etc should be confined to dedicated microphone technology but thinking multiple streams could go to mycroft core.
Placement of your Mic in a room to the noise in a room can have drastic effects as anyone will of noticed if your TV or Radio is in between the 2 of you. I have been really impressed by the relative low load of mycroft on a Pi4 and been thinking with the low cost of bluetooth/wifi mics/speakers this could actually be really beneficial.
I haven't checked how much load a VAD processing creates but thinking its probably very possible. With pulse audio I think you can buffer to a frame.wav so you could have all inputs running. Then you could have VAD instances on each source and dumping a VAD rating to a text history for that source. So you could prob constantly auto switch between various microphones so the best signal is presented to Mycroft.
I am going to have a play with https://github.com/wiseman/py-webrtcvad and see what sort of load it gives.
Otherwise you could just sum inputs but if a mic is being swamped by noise it might make recognition very low.
@JarbasAl has made some progress toward this and related ends. I don't know if he's addressed multiple mics on the same device, but it does concern multiple lightweight devices each speaking to the same or a parent instance of Mycroft.
Here's a repo with links to repos that might be of interest.
I definitely agree with Mr. Penrod's original assessment, that multiple mics for one device are probably out of scope for core. Summing inputs seems like a no-go not because of noise, but because of mismatched latency which (by definition) can't be calculated by the device itself.
Switching inputs seems like a no-go because Mycroft isn't recording when you aren't speaking, so it won't compensate for noise until after it's processed a noisy utterance - and then it can't verify the new input until you speak to it again.
Hence, every time you spoke to it, it would compare the audio quality just now with the previously-known audio quality from other inputs, at some arbitrary and unpredictable moment in the past, never the same moment for any two microphones, and the circumstances affecting which is the "best" input will usually have changed in the interim (moved the radio, the person moved, powered up a nearby stereo amplifier...)
you should be able to run as many voice satellites as you want, each can be it's own mic, however this assumes each mic can run wake word engines, so it's not quite there yet, for example it might be slow in a pi0 or not run at all (untested)
I am exploring options with streaming audio which should work for any mic/streaming sound source. My hope is to have a reference implementation with a cheap ESP32 device
The question was "Is it possible to connect multiple microphones and mount them in different parts of a room e.g. one next to the bed and one on the desk?"
Jarbas has a solution with Hive. Dunno about those ESP8266 specifically as read the onboard ADC can be a bit 'clicky'. Buts its a good example of the low price devices that are out there.
In terms of the question the answer is natively in the core currently could you then the answer is no unless using the great Hive stuff Jarbass has done.
In terms of dumb mic inputs you prob could but the core isn't needed for that. My head is still spinning with the versatility of ALSA & Pulse audio that you can make a huge array of sources, sinks, dmixes, arrays and virtual devices that I would say you definately could.
Also that summing is prob too much of a simplification but I presume multiple microphones can be set up as an array through webrtc such as beamforming=1 mic_geometry=-0.03,0,0,-0.01,0,0,0.01,0,0,0.03,0,0”
It would be interesting to see as the metric is metres and the result should be in very simplified terms a latency corrected sum.
I presume even beam forming array mics can be created as individual ponits of a virtual array of a 'large' room mic.
Its not part of the core as its all part of ALSA/pulseaudio but thinking a mic setup routine that is a bit more complex than current might be a great addition to the core.
I have done my usual as I saw a great example of a pulse stream using a wav file buffer that could very well act as a source for a VAD level detector. That could be routed to a dmix that can mute inputs below a threshold and that stream is presented to Precise.
I think a mic routing and preprocessor to Precise could have a lot of benefits as much is just config. AlSA & Pulseaudio is so comprehensive that there is so much you can do under the hood by just config but after a I week I am far from any where closer apart from thinking with a few more brain cells I could.
I think you would have to define an array in ALSA that pulse audio could use but essentially there is no difference from the 4 mics on a Respeaker than 4 mic in a room apart from distance between and orientation of x, y, z
If you flash a desktop version of Raspbian and have a look at pavucontrol as in https://community.mycroft.ai/t/ps3-eye-best-settings/8152/6?u=stuartiannaylor It makes things easier to visualise. In /etc/pulse/default.pa i have.
load-module module-echo-cancel use_master_format=1 aec_method=webrtc aec_args=“analog_gain_control=0\ digital_gain_control=1\ noise_suppression=1\ voice_detection=1\ beamforming=1\ mic_geometry=-0.03,0,0,-0.01,0,0,0.01,0,0,0.03,0,0” source_name=echoCancel_source sink_name=echoCancel_sink
set-default-source echoCancel_source
set-default-sink echoCancel_sink
In CNX they had an also example for my PS3 eye which I am not using as it seems to be detected. https://www.cnx-software.com/2019/08/30/using-sony-ps3-eye-camera-as-an-inexpensive-microphone-array/
I have been trying to grab the softvolume though as currently its rather low.
pcm.array_gain {
type softvol
slave {
pcm "array"
}
control {
name "Mic Gain"
count 2
card 0
}
min_dB -40.0
max_dB 10.0
resolution 80
}
As my volume maxes at 150% but with ALSA they set up a 10dB softvol. So yeah its all very possible but get it working can be a different matter. Like the above say the voice_detection=1 I am not sure why or what that does as I thought it was a detection filter rather than active filter. Also with the documentation from pulseaudio its not great at times as load-module module-echo-cancel might with luck actually do what I am suggesting with VAD anyway but without playing I am not really sure, dunno if I will be wiser after :)
Got a feeling though a preprocessing module for Mycroft would be benecial and have a few different uses. From using VAD to trigger Precise to using beamforming to select the best mic reception, not sure if I am already even doing this on each channel of the array I have set up. Reading through https://github.com/freedesktop/pulseaudio-webrtc-audio-processing is going to take some time with the likes of me :)
I'm relatively lazy (citation needed), so I'd probably set up something like this:
1) multiple pi's + mic's running recognizer loop + intent determination, but detached from main skills service 2) A (probably time-window based) de-duplication interceptor between IntentDeterminationEngine and the skills service captures the intents as observed by each mic/recognizer. Picks one based on intent confidence or some other heuristics. 3) Forward winning intent to skills service. 4) Profit!
@clusterfudge my voice satellite is not that different, apart from the intent determination, instead every mic gets a skill response from core
note that the response is local to each mic, they don't receive each other's responses, a central mycroft handles individual utterances and also does not speak the answers out loud (answers are handled by the satellite), i can change this trivially now that #2461 is in!
What you describe makes some sense if you want a central unit to answer and the mics are input only, but i would still handle the intent on the mycroft device
in this case the issue is capturing the same utterance in different mics, this could be handled with a time window and then best transcription selected, the intent_service already expects a utterance list
If for some reason we want to do intent determination in the mic, we can use the new IntentApi from #2468
I'm relatively lazy but also a skin flint as Precise seems to need a Pi3 or above when you get you SD Card, PSU & Mic it just adds up to more than it should, even if next to the bed and one on the desk.
If your in the same room then you can get a bluetooth speaker/mic for £10 and less that even sometimes looks quite stylish in a ready made enclosure.
You would prob want to turn AGC off but you could even do something quite simple doesn't have to be VAD but prob would also help but an automixer that switches between the 2 mic streams and outputs the loudest on a 3rd which Precise uses.
You could just have a wired mic doing the same but likely to be cable hell and why I keep mentioning wireless, but the only reason.
I can have the Pi4 Mycroft on the shelf as its a kitchen/lounge and have a bluetooth speaker/mic prob with speaker volume off by the settee. I rarely sit down and watch TV but trying to cast to the Mycroft across whatever is on the TV never goes down well, but I now have to my side another mic so no need to.
I am just confusing things as thinking you could run Mycroft with various intent threads all based on a mic automixer, purely because I have been impressed how admirabilly the Pi4 copes and that actually for the most its redundent, so how could we use it.
I will prob use the webrtc lib to use the AGC routines purely as a volume estimator and the VAD and only head scratch is how to dump that as a pulseaudio source than as a stream. Try and keep it totally seperate and a preprocessing service. I want to keep it cheap, don't want loads of Pi's or cables and think even I could hack a bit of python for that, even if it has been a couple of year and don't remember much about python at the moment. But thinking with the likes of you guys you could prob knock up a load of uses for pre/post processing and routing audio streams even pairing with sessions, but that was just my mind wandering.
Closing Issue since we're archiving the repo
Is it possible to connect multiple microphones and mount them in different parts of a room e.g. one next to the bed and one on the desk?