Open MediGenie opened 10 months ago
Thank you for your interest in flutter_silero_vad!
I have updated the README.md to include a description of how it works. If there are any points that are unclear, please feel free to ask further questions.
Thankyou!๐๐ป
Hey @char5742!
So once again thank you for your helpful "How it works" tutorial. So i am referencing this as I have been using it for the web and I am now trying to transition my parameter/settings to your code (https://wiki.vad.ricky0123.com/docs/user/algorithm#configuration)
Parameter using Ricky0123:
positiveSpeechThreshold: number - determines the threshold over which a probability is considered to indicate the presence of speech. negativeSpeechThreshold: number - determines the threshold under which a probability is considered to indicate the absence of speech. redemptionFrames: number - number of speech-negative frames to wait before ending a speech segment. frameSamples: number - the size of a frame in samples - 1536 by default and probably should not be changed. preSpeechPadFrames: number - number of audio frames to prepend to a speech segment. minSpeechFrames: number - minimum number of speech-positive frames for a speech segment.
Your code:
`For the initialize method, the arguments are as follows:
modelPath: The path to the Silero VAD onnx model. sampleRate: The sample rate of the audio file you want to detect. frameSize: The size of the segment to detect (Silero VAD is trained with 30ms). threshold minSilenceDurationMs: After it becomes silent, this duration will be included in the detection segment. speechPadMs: Currently not in use. About resetState: Since Silero VAD is an RNN, the model has a state. Calling resetState will reset the model's state.`
Question:
Let me know if you have patreon! i would love to donate some coffee!
Greetings from Seoul! KJ
Hello, ๏ผ Med!
I'm glad you found the "How it works" tutorial helpful! Regarding your questions about transitioning your parameters/settings to the Silero VAD code:
frameSize: In the code, the frameSize parameter represents the duration of the audio window that the model analyzes at once, with the unit being milliseconds (ms). For example, in the C++ sample, a frameSize of 64 is used. This value represents a trade-off; increasing it can improve accuracy but also results in longer processing times. In your documentation, a frameSize of 1536 is used.
Creating Similar Parameters: positiveSpeechThreshold: This would correspond to threshold. negativeSpeechThreshold: This is hardcoded and 0.15 less than the threshold value. redemptionFrames: This would correspond to minSilenceDurationMs. frameSamples: This appears to correspond to frameSize. Parameters like preSpeechPadFrames and minSpeechFrames are not directly processed in your library. To incorporate such features, you may need to implement additional logic outside of the current library functions.
I hope this helps you configure your Silero VAD setup accordingly.
I'm glad to hear that you would consider supporting my work in such a generous way. However, I don't have a Patreon account at the moment, so please just take my assistance as a goodwill gesture. Your appreciation is more than enough for me!
Let me know if you have any further questions or need additional assistance!
wow, super. thank you so much for your detailed feedback. I am extremely grateful.
I was wondering if your library has a way to detect/noise cancel audio output from the device. There is this feedback loop where the phone's speaker sound of someone speaking is feeding into VAD.
thank you so much! KJ
The feedback loop issue you mentioned can indeed be addressed with echo cancellation technology. This technology is used to prevent echoes and feedback caused by sound from the speakers entering the microphone. I have a repository named audio_streamer that includes samples of implementing echo cancellation. This might be helpful in resolving your issue.
I wanted to take a moment to express my heartfelt gratitude for your support through GitHub Sponsors.
Hey so thank you so much for this. I am testing the basic app, but the demo app is unable to detect the sound...(maybe the minimum threshold decibel is set too high? or something with the audio input you think?) So it seems like the Flutter VAD was already using a audio streamer of some sort and what i did was replaced it with the one you shared with me 4 days. Can you share some tips or help me debug what I might be the issue? Thank you.
I apologize, but I have not been able to reproduce the issue on an iPhone 12 with iOS 17.2.1 and a Pixel 6a with Android 14.
Could you please share more details about your environment?
Hello! so i am running Android on Galaxy Note 8 and Android 9. I am running iphone 8 plus and 16.7.5. I am sending you recorder.dart below.
` final recorder = AudioStreamer.instance;
final vad = FlutterSileroVad();
Future
/// ์ํ๋น ๋นํธ ์ final int bitsPerSample = 16;
/// ์ฑ๋ ์ final int numChannels = 1;
bool isInited = false;
/// ์ง์ ์ ์ค๋์ค ๋ฐ์ดํฐ๋ฅผ ์ ์ฅํ๊ธฐ ์ํ ๋ณ์
final lastAudioData =
/// ์์ฑ์ด ๋ฉ์ถ ํ ๋ช ์ด ํ์ ์์ฑ ๋ฐ์ดํฐ๋ฅผ ์ ์ฅํ๋ ๋ณ์
DateTime? lastActiveTime;
final processedAudioStreamController = StreamController<List
AudioPlayer audioPlayer = AudioPlayer(); bool isLoading = false; bool isPlaying = false; bool isThinking = false; AppState _currentStatus = AppState.standby; String _responseText = ''; // ์๋ฒ ์๋ต ํ ์คํธ๋ฅผ ์ ์ฅํ ๋ณ์
AppState get currentStatus => _currentStatus; String get responseText => _responseText;
final frameBuffer =
Future
isInited = true;
}
Future
final session = await AudioSession.instance;
await session.configure(AudioSessionConfiguration(
avAudioSessionCategory: AVAudioSessionCategory.playAndRecord,
avAudioSessionCategoryOptions: AVAudioSessionCategoryOptions.allowBluetooth | AVAudioSessionCategoryOptions.defaultToSpeaker,
// iOS๋ฅผ voiceChat์ผ๋ก ์ค์ ํ๋ฉด ์์ฝ ์ทจ์๊ฐ ํ์ฑํ๋ฉ๋๋ค.
avAudioSessionMode: echoCancellation ? AVAudioSessionMode.voiceChat : AVAudioSessionMode.defaultMode,
avAudioSessionRouteSharingPolicy: AVAudioSessionRouteSharingPolicy.defaultPolicy,
avAudioSessionSetActiveOptions: AVAudioSessionSetActiveOptions.none,
androidAudioAttributes: const AndroidAudioAttributes(
contentType: AndroidAudioContentType.speech,
flags: AndroidAudioFlags.none,
usage: AndroidAudioUsage.voiceCommunication,
),
androidAudioFocusGainType: AndroidAudioFocusGainType.gain,
androidWillPauseWhenDucked: true,
));
await recorder.startRecording(echoCancellation ? 7 : 0);
await onnxModelToLocal();
await vad.initialize(
modelPath: await modelPath,
sampleRate: sampleRate,
frameSize: frameSize,
threshold: 0.2,
minSilenceDurationMs: 500,
speechPadMs: 0,
);
// ๊ธฐ์กด ๊ตฌ๋
์ด ์์ผ๋ฉด ์ทจ์ํฉ๋๋ค.
await recordingDataSubscription?.cancel();
await processedAudioSubscription?.cancel();
recordingDataSubscription = recorder.audioStream.listen((buffer) async {
//debugPrint('buffer length: ${buffer.length}');
final data = _transformBuffer(buffer);
if (data.isEmpty || isThinking) return;
frameBuffer.addAll(buffer);
while (frameBuffer.length >= frameSize * 2 * sampleRate ~/ 1000) {
final b = frameBuffer.take(frameSize * 2 * sampleRate ~/ 1000).toList();
frameBuffer.removeRange(0, frameSize * 2 * sampleRate ~/ 1000);
await _handleProcessedAudio(b);
}
controller.add(data);
});
processedAudioSubscription = processedAudioStreamController.stream.listen((buffer) async {
if (isPlaying || isThinking) return;
String outputPath = '${(await getApplicationDocumentsDirectory()).path}/output.wav';
double duration = saveAsWav(buffer, outputPath);
debugPrint('duration == $duration');
if (duration < 0.4) return;
debugPrint('saved == $outputPath');
_currentStatus = AppState.thinking;
isThinking = true;
String responseAudio = await _uploadFile(outputPath, "");
if (isLoading) {
isLoading = false;
_currentStatus = AppState.speaking;
await playAudio(responseAudio);
}
});
}
Future
Int16List _transformBuffer(List
void printVolume(List
final double rms = sqrt(sum / (data.length / 2)); // RMS ๊ณ์ฐ
final double volume = 20 * log(rms) / ln10; // ๋ฐ์๋ฒจ๋ก ๋ณํ
debugPrint('Volume: $volume dB');
}
static const threshold = 900; // ์ด ์๊ณ๊ฐ์ ์์ฑ ๋ ๋ฒจ์ ๋ฐ๋ผ ์กฐ์ ์ด ํ์
static const bufferTimeInMilliseconds = 700;
final audioDataBuffer =
Future
final isActivated = await vad.predict(Float32List.fromList(transformedBufferFloat));
//debugPrint(isActivated.toString());
if (isActivated == true) {
if (!isPlaying) {
_currentStatus = AppState.listening;
}
lastActiveTime = DateTime.now();
audioDataBuffer.addAll(lastAudioData);
lastAudioData.clear();
audioDataBuffer.addAll(buffer);
if (isPlaying) {
isPlaying = false;
audioPlayer.stop();
}
} else if (lastActiveTime != null) {
audioDataBuffer.addAll(buffer);
debugPrint(DateTime.now().difference(lastActiveTime!).toString());
// ์ผ์ ์๊ฐ์ด ์ง๋๋ฉด ์์ฑ ๋ฐ์ดํฐ ์ ์ฅ
if (DateTime.now().difference(lastActiveTime!) > const Duration(milliseconds: bufferTimeInMilliseconds)) {
processedAudioStreamController.add([...audioDataBuffer]);
audioDataBuffer.clear();
lastActiveTime = null;
}
} else {
// ์์ฑ์ด ์๋ ์ํ
lastAudioData.addAll(buffer);
// 5์ด๋ถ์ ๋ฐ์ดํฐ๋ฅผ ์ ์ฅํด ๋๋ค
final threshold = sampleRate * 500 ~/ 1000;
if (lastAudioData.length > threshold) {
lastAudioData.removeRange(0, lastAudioData.length - threshold);
}
}
}`
I am using this Recorder.dart here.
`Widget build(BuildContext context, WidgetRef ref) {
final recorderServiceState = ref.watch(recorderServiceProvider);
final appState = useState
String lottieFile = 'assets/lottie/lottie1.json';
String statusText = "I am waiting for you to speak...";
_currentStatus = recorderServiceState.currentStatus;
switch (_currentStatus) {
case AppState.standby:
lottieFile = 'assets/lottie/lottie1.json';
statusText = "I am waiting for you to speak...";
break;
case AppState.listening:
lottieFile = 'assets/lottie/lottie2.json';
statusText = "I'm listening";
break;
case AppState.thinking:
lottieFile = 'assets/lottie/lottie3.json';
statusText = "I'm thinking";
break;
case AppState.speaking:
lottieFile = 'assets/lottie/lottie4.json';
statusText = 'I am speaking';
break;
}
//debugPrint(recorderServiceState.currentStatus.toString());
final controller = useStreamController<List<int>>();
final spots = useState<List<int>>([]);
useOnAppLifecycleStateChange((beforeState, currState) {
if (currState == AppLifecycleState.resumed) {
ref.read(recorderServiceProvider).record(controller);
} else if (currState == AppLifecycleState.paused) {
ref.read(recorderServiceProvider).stopRecorder();
}
});
useEffect(() {
// ํ ํฐ์ ๊ฐ์ ธ์ค๊ณ ๋ ์ฝ๋ ์ด๊ธฐํ ๋ฐ ๋ ์ฝ๋ฉ ์์
Future<void> initializeAndStartRecording() async {
if (SharedPreferencesManager.getString(TOKEN) == null) {
// UUID ์์ฑ
var uuid = const Uuid();
final String identifier = uuid.v4();
var token = await NetworkManager().fetchToken(identifier);
if (token != null) {
debugPrint("Network Token: $token");
await SharedPreferencesManager.setString(TOKEN, token);
try {
// recorderService ์ด๊ธฐํ ๋ฐ ๋ ์ฝ๋ฉ ์์
await ref.read(recorderServiceProvider).init();
debugPrint("Recorder initialized");
await ref.read(recorderServiceProvider).record(controller);
debugPrint("Recorder started");
} catch (e) {
debugPrint("Error initializing or starting the recorder: $e");
}
} else {
debugPrint("Token fetching failed");
return;
}
} else {
debugPrint("Token: ${SharedPreferencesManager.getString(TOKEN)!}");
NetworkManager().setBearerToken(SharedPreferencesManager.getString(TOKEN)!);
try {
// recorderService ์ด๊ธฐํ ๋ฐ ๋ ์ฝ๋ฉ ์์
await ref.read(recorderServiceProvider).init();
debugPrint("Recorder initialized");
await ref.read(recorderServiceProvider).record(controller);
debugPrint("Recorder started");
} catch (e) {
debugPrint("Error initializing or starting the recorder: $e");
}
}
}
// ํจ์ ์คํ
initializeAndStartRecording();
// controller์ ์คํธ๋ฆผ์ ๋ํ ๋ฆฌ์ค๋ ์ค์
final subscription = controller.stream.listen((event) {
final buffer = event.toList();
spots.value = buffer;
});
// ๋ฆฌ์ค๋ ์ค์
final listener = ref.listen(recorderServiceProvider, (_, state) {
appState.value = state.currentStatus;
responseText.value = state.responseText; // null์ผ ๊ฒฝ์ฐ ๋น ๋ฌธ์์ด ํ ๋น
});
// ์ปดํฌ๋ํธ๊ฐ ์ธ๋ง์ดํธ๋ ๋ ์คํ๋ ํด๋ฆฐ์
ํจ์ ๋ฐํ
return () {
subscription.cancel(); // ๊ตฌ๋
์ทจ์
listener; // ๋ฆฌ์ค๋ ๊ตฌ๋
์ทจ์
};
}, []); // ์์กด์ฑ ๋ฐฐ์ด์ด ๋น์ด ์์ผ๋ฏ๋ก, ์ปดํฌ๋ํธ๊ฐ ๋ง์ดํธ๋ ๋ ํ ๋ฒ๋ง ์คํ๋ฉ๋๋ค.`
Thank you so much!
@char5742 can you please give an example of how to correctly read .wav file? I can not get it working. If threshold <= 0.9 - vad.predict always returns true. If threshold >= 0.95 - vad.predict always returns false.
Hello! thanks for the plugin!
I was wondering how hard would it be to create an api like this using flutter? https://github.com/ricky0123/vad Would it work out of the box? Sorry if this question is basic.
thank you! KJ