char5742 / flutter_silero_vad

This is an unofficial plugin for calling the Silero VAD ONNX model in Flutter.
Other
26 stars 15 forks source link

Using Flutter_Silero_VAD #5

Open MediGenie opened 9 months ago

MediGenie commented 9 months ago

Hello! thanks for the plugin!

I was wondering how hard would it be to create an api like this using flutter? https://github.com/ricky0123/vad Would it work out of the box? Sorry if this question is basic.

thank you! KJ

char5742 commented 9 months ago

Thank you for your interest in flutter_silero_vad!

I have updated the README.md to include a description of how it works. If there are any points that are unclear, please feel free to ask further questions.

MediGenie commented 9 months ago

Thankyou!๐Ÿ™๐Ÿป

MediGenie commented 9 months ago

Hey @char5742!

So once again thank you for your helpful "How it works" tutorial. So i am referencing this as I have been using it for the web and I am now trying to transition my parameter/settings to your code (https://wiki.vad.ricky0123.com/docs/user/algorithm#configuration)

Parameter using Ricky0123: positiveSpeechThreshold: number - determines the threshold over which a probability is considered to indicate the presence of speech. negativeSpeechThreshold: number - determines the threshold under which a probability is considered to indicate the absence of speech. redemptionFrames: number - number of speech-negative frames to wait before ending a speech segment. frameSamples: number - the size of a frame in samples - 1536 by default and probably should not be changed. preSpeechPadFrames: number - number of audio frames to prepend to a speech segment. minSpeechFrames: number - minimum number of speech-positive frames for a speech segment.

Your code:

`For the initialize method, the arguments are as follows:

modelPath: The path to the Silero VAD onnx model. sampleRate: The sample rate of the audio file you want to detect. frameSize: The size of the segment to detect (Silero VAD is trained with 30ms). threshold minSilenceDurationMs: After it becomes silent, this duration will be included in the detection segment. speechPadMs: Currently not in use. About resetState: Since Silero VAD is an RNN, the model has a state. Calling resetState will reset the model's state.`

Question:

  1. Is each frameSize in 100ms for your code? What are the units?
  2. Is there a way i can "create" similar parameters using your code? Thank you so much again!

Let me know if you have patreon! i would love to donate some coffee!

Greetings from Seoul! KJ

char5742 commented 9 months ago

Hello, ๏ผ Med!

I'm glad you found the "How it works" tutorial helpful! Regarding your questions about transitioning your parameters/settings to the Silero VAD code:

  1. frameSize: In the code, the frameSize parameter represents the duration of the audio window that the model analyzes at once, with the unit being milliseconds (ms). For example, in the C++ sample, a frameSize of 64 is used. This value represents a trade-off; increasing it can improve accuracy but also results in longer processing times. In your documentation, a frameSize of 1536 is used.

  2. Creating Similar Parameters: positiveSpeechThreshold: This would correspond to threshold. negativeSpeechThreshold: This is hardcoded and 0.15 less than the threshold value. redemptionFrames: This would correspond to minSilenceDurationMs. frameSamples: This appears to correspond to frameSize. Parameters like preSpeechPadFrames and minSpeechFrames are not directly processed in your library. To incorporate such features, you may need to implement additional logic outside of the current library functions.

I hope this helps you configure your Silero VAD setup accordingly.

I'm glad to hear that you would consider supporting my work in such a generous way. However, I don't have a Patreon account at the moment, so please just take my assistance as a goodwill gesture. Your appreciation is more than enough for me!

Let me know if you have any further questions or need additional assistance!

MediGenie commented 9 months ago

wow, super. thank you so much for your detailed feedback. I am extremely grateful.

I was wondering if your library has a way to detect/noise cancel audio output from the device. There is this feedback loop where the phone's speaker sound of someone speaking is feeding into VAD.

thank you so much! KJ

char5742 commented 8 months ago

The feedback loop issue you mentioned can indeed be addressed with echo cancellation technology. This technology is used to prevent echoes and feedback caused by sound from the speakers entering the microphone. I have a repository named audio_streamer that includes samples of implementing echo cancellation. This might be helpful in resolving your issue.

I wanted to take a moment to express my heartfelt gratitude for your support through GitHub Sponsors.

MediGenie commented 8 months ago

Hey so thank you so much for this. I am testing the basic app, but the demo app is unable to detect the sound...(maybe the minimum threshold decibel is set too high? or something with the audio input you think?) So it seems like the Flutter VAD was already using a audio streamer of some sort and what i did was replaced it with the one you shared with me 4 days. Can you share some tips or help me debug what I might be the issue? Thank you.

char5742 commented 8 months ago

I apologize, but I have not been able to reproduce the issue on an iPhone 12 with iOS 17.2.1 and a Pixel 6a with Android 14.

Could you please share more details about your environment?

MediGenie commented 8 months ago

Hello! so i am running Android on Galaxy Note 8 and Android 9. I am running iphone 8 plus and 16.7.5. I am sending you recorder.dart below.

` final recorder = AudioStreamer.instance; final vad = FlutterSileroVad(); Future get modelPath async => '${(await getApplicationSupportDirectory()).path}/silero_vad.onnx'; final sampleRate = 16000; final frameSize = 40; // 80ms

/// ์ƒ˜ํ”Œ๋‹น ๋น„ํŠธ ์ˆ˜ final int bitsPerSample = 16;

/// ์ฑ„๋„ ์ˆ˜ final int numChannels = 1;

bool isInited = false;

/// ์ง์ „์˜ ์˜ค๋””์˜ค ๋ฐ์ดํ„ฐ๋ฅผ ์ €์žฅํ•˜๊ธฐ ์œ„ํ•œ ๋ณ€์ˆ˜ final lastAudioData = [];

/// ์Œ์„ฑ์ด ๋ฉˆ์ถ˜ ํ›„ ๋ช‡ ์ดˆ ํ›„์— ์Œ์„ฑ ๋ฐ์ดํ„ฐ๋ฅผ ์ €์žฅํ•˜๋Š” ๋ณ€์ˆ˜ DateTime? lastActiveTime; final processedAudioStreamController = StreamController<List>(); StreamSubscription<List>? recordingDataSubscription; StreamSubscription<List>? processedAudioSubscription;

AudioPlayer audioPlayer = AudioPlayer(); bool isLoading = false; bool isPlaying = false; bool isThinking = false; AppState _currentStatus = AppState.standby; String _responseText = ''; // ์„œ๋ฒ„ ์‘๋‹ต ํ…์ŠคํŠธ๋ฅผ ์ €์žฅํ•  ๋ณ€์ˆ˜

AppState get currentStatus => _currentStatus; String get responseText => _responseText;

final frameBuffer = [];

Future init() async { var status = await Permission.microphone.request(); if (status != PermissionStatus.granted) { throw Exception('Microphone permission not granted'); }

isInited = true;

}

Future record(StreamController<List> controller, [bool echoCancellation = true]) async { assert(isInited);

final session = await AudioSession.instance;
await session.configure(AudioSessionConfiguration(
  avAudioSessionCategory: AVAudioSessionCategory.playAndRecord,
  avAudioSessionCategoryOptions: AVAudioSessionCategoryOptions.allowBluetooth | AVAudioSessionCategoryOptions.defaultToSpeaker,
  // iOS๋ฅผ voiceChat์œผ๋กœ ์„ค์ •ํ•˜๋ฉด ์—์ฝ” ์ทจ์†Œ๊ฐ€ ํ™œ์„ฑํ™”๋ฉ๋‹ˆ๋‹ค.
  avAudioSessionMode: echoCancellation ? AVAudioSessionMode.voiceChat : AVAudioSessionMode.defaultMode,
  avAudioSessionRouteSharingPolicy: AVAudioSessionRouteSharingPolicy.defaultPolicy,
  avAudioSessionSetActiveOptions: AVAudioSessionSetActiveOptions.none,
  androidAudioAttributes: const AndroidAudioAttributes(
    contentType: AndroidAudioContentType.speech,
    flags: AndroidAudioFlags.none,
    usage: AndroidAudioUsage.voiceCommunication,
  ),
  androidAudioFocusGainType: AndroidAudioFocusGainType.gain,
  androidWillPauseWhenDucked: true,
));

await recorder.startRecording(echoCancellation ? 7 : 0);
await onnxModelToLocal();
await vad.initialize(
  modelPath: await modelPath,
  sampleRate: sampleRate,
  frameSize: frameSize,
  threshold: 0.2,
  minSilenceDurationMs: 500,
  speechPadMs: 0,
);

// ๊ธฐ์กด ๊ตฌ๋…์ด ์žˆ์œผ๋ฉด ์ทจ์†Œํ•ฉ๋‹ˆ๋‹ค.
await recordingDataSubscription?.cancel();
await processedAudioSubscription?.cancel();

recordingDataSubscription = recorder.audioStream.listen((buffer) async {
  //debugPrint('buffer length: ${buffer.length}');
  final data = _transformBuffer(buffer);
  if (data.isEmpty || isThinking) return;
  frameBuffer.addAll(buffer);
  while (frameBuffer.length >= frameSize * 2 * sampleRate ~/ 1000) {
    final b = frameBuffer.take(frameSize * 2 * sampleRate ~/ 1000).toList();
    frameBuffer.removeRange(0, frameSize * 2 * sampleRate ~/ 1000);
    await _handleProcessedAudio(b);
  }
  controller.add(data);
});

processedAudioSubscription = processedAudioStreamController.stream.listen((buffer) async {
  if (isPlaying || isThinking) return;
  String outputPath = '${(await getApplicationDocumentsDirectory()).path}/output.wav';
  double duration = saveAsWav(buffer, outputPath);
  debugPrint('duration == $duration');
  if (duration < 0.4) return;
  debugPrint('saved == $outputPath');
  _currentStatus = AppState.thinking;
  isThinking = true;
  String responseAudio = await _uploadFile(outputPath, "");
  if (isLoading) {
    isLoading = false;
    _currentStatus = AppState.speaking;
    await playAudio(responseAudio);
  }
});

}

Future stopRecorder() async { await recorder.startRecording(); if (recordingDataSubscription != null) { await recordingDataSubscription?.cancel(); recordingDataSubscription = null; await processedAudioSubscription?.cancel(); processedAudioSubscription = null; } }

Int16List _transformBuffer(List buffer) { final bytes = Uint8List.fromList(buffer); return Int16List.view(bytes.buffer); }

void printVolume(List data) { // PCM ๋ฐ์ดํ„ฐ๋Š” 16๋น„ํŠธ(2๋ฐ”์ดํŠธ)์ด๋ฏ€๋กœ 2๋ฐ”์ดํŠธ ๋‹จ์œ„๋กœ ์ฒ˜๋ฆฌํ•ฉ๋‹ˆ๋‹ค. double sum = 0; for (var i = 0; i < data.length; i += 2) { final int16 = data[i] + (data[i + 1] << 8); // PCM 16๋น„ํŠธ ๋ฐ์ดํ„ฐ final double sample = int16 / (1 << 15); // -1์—์„œ 1๊นŒ์ง€์˜ ๋ฒ”์œ„๋กœ ์ •๊ทœํ™” sum += sample * sample; // ์ œ๊ณฑํ•ฉ ๊ณ„์‚ฐ }

final double rms = sqrt(sum / (data.length / 2)); // RMS ๊ณ„์‚ฐ
final double volume = 20 * log(rms) / ln10; // ๋ฐ์‹œ๋ฒจ๋กœ ๋ณ€ํ™˜

debugPrint('Volume: $volume dB');

}

static const threshold = 900; // ์ด ์ž„๊ณ„๊ฐ’์€ ์Œ์„ฑ ๋ ˆ๋ฒจ์— ๋”ฐ๋ผ ์กฐ์ •์ด ํ•„์š” static const bufferTimeInMilliseconds = 700; final audioDataBuffer = [];

Future _handleProcessedAudio(List buffer) async { final transformedBuffer = _transformBuffer(buffer); final transformedBufferFloat = transformedBuffer.map((e) => e / 32768).toList();

final isActivated = await vad.predict(Float32List.fromList(transformedBufferFloat));
//debugPrint(isActivated.toString());
if (isActivated == true) {
  if (!isPlaying) {
    _currentStatus = AppState.listening;
  }
  lastActiveTime = DateTime.now();
  audioDataBuffer.addAll(lastAudioData);
  lastAudioData.clear();
  audioDataBuffer.addAll(buffer);
  if (isPlaying) {
    isPlaying = false;
    audioPlayer.stop();
  }
} else if (lastActiveTime != null) {
  audioDataBuffer.addAll(buffer);
  debugPrint(DateTime.now().difference(lastActiveTime!).toString());
  // ์ผ์ • ์‹œ๊ฐ„์ด ์ง€๋‚˜๋ฉด ์Œ์„ฑ ๋ฐ์ดํ„ฐ ์ €์žฅ
  if (DateTime.now().difference(lastActiveTime!) > const Duration(milliseconds: bufferTimeInMilliseconds)) {
    processedAudioStreamController.add([...audioDataBuffer]);
    audioDataBuffer.clear();
    lastActiveTime = null;
  }
} else {
  // ์Œ์„ฑ์ด ์—†๋Š” ์ƒํƒœ
  lastAudioData.addAll(buffer);
  // 5์ดˆ๋ถ„์˜ ๋ฐ์ดํ„ฐ๋ฅผ ์ €์žฅํ•ด ๋‘”๋‹ค
  final threshold = sampleRate * 500 ~/ 1000;
  if (lastAudioData.length > threshold) {
    lastAudioData.removeRange(0, lastAudioData.length - threshold);
  }
}

}`

I am using this Recorder.dart here.

`Widget build(BuildContext context, WidgetRef ref) { final recorderServiceState = ref.watch(recorderServiceProvider); final appState = useState(AppState.standby); final responseText = useState("");

String lottieFile = 'assets/lottie/lottie1.json';
String statusText = "I am waiting for you to speak...";

_currentStatus = recorderServiceState.currentStatus;

switch (_currentStatus) {
  case AppState.standby:
    lottieFile = 'assets/lottie/lottie1.json';
    statusText = "I am waiting for you to speak...";
    break;
  case AppState.listening:
    lottieFile = 'assets/lottie/lottie2.json';
    statusText = "I'm listening";
    break;
  case AppState.thinking:
    lottieFile = 'assets/lottie/lottie3.json';
    statusText = "I'm thinking";
    break;
  case AppState.speaking:
    lottieFile = 'assets/lottie/lottie4.json';
    statusText = 'I am speaking';
    break;
}

//debugPrint(recorderServiceState.currentStatus.toString());

final controller = useStreamController<List<int>>();
final spots = useState<List<int>>([]);
useOnAppLifecycleStateChange((beforeState, currState) {
  if (currState == AppLifecycleState.resumed) {
    ref.read(recorderServiceProvider).record(controller);
  } else if (currState == AppLifecycleState.paused) {
    ref.read(recorderServiceProvider).stopRecorder();
  }
});
useEffect(() {
  // ํ† ํฐ์„ ๊ฐ€์ ธ์˜ค๊ณ  ๋ ˆ์ฝ”๋” ์ดˆ๊ธฐํ™” ๋ฐ ๋ ˆ์ฝ”๋”ฉ ์‹œ์ž‘
  Future<void> initializeAndStartRecording() async {
    if (SharedPreferencesManager.getString(TOKEN) == null) {
      // UUID ์ƒ์„ฑ
      var uuid = const Uuid();
      final String identifier = uuid.v4();
      var token = await NetworkManager().fetchToken(identifier);
      if (token != null) {
        debugPrint("Network Token: $token");
        await SharedPreferencesManager.setString(TOKEN, token);
        try {
          // recorderService ์ดˆ๊ธฐํ™” ๋ฐ ๋ ˆ์ฝ”๋”ฉ ์‹œ์ž‘
          await ref.read(recorderServiceProvider).init();
          debugPrint("Recorder initialized");
          await ref.read(recorderServiceProvider).record(controller);
          debugPrint("Recorder started");
        } catch (e) {
          debugPrint("Error initializing or starting the recorder: $e");
        }
      } else {
        debugPrint("Token fetching failed");
        return;
      }
    } else {
      debugPrint("Token: ${SharedPreferencesManager.getString(TOKEN)!}");
      NetworkManager().setBearerToken(SharedPreferencesManager.getString(TOKEN)!);
      try {
        // recorderService ์ดˆ๊ธฐํ™” ๋ฐ ๋ ˆ์ฝ”๋”ฉ ์‹œ์ž‘
        await ref.read(recorderServiceProvider).init();
        debugPrint("Recorder initialized");
        await ref.read(recorderServiceProvider).record(controller);
        debugPrint("Recorder started");
      } catch (e) {
        debugPrint("Error initializing or starting the recorder: $e");
      }
    }
  }

  // ํ•จ์ˆ˜ ์‹คํ–‰
  initializeAndStartRecording();

  // controller์˜ ์ŠคํŠธ๋ฆผ์— ๋Œ€ํ•œ ๋ฆฌ์Šค๋„ˆ ์„ค์ •
  final subscription = controller.stream.listen((event) {
    final buffer = event.toList();
    spots.value = buffer;
  });

  // ๋ฆฌ์Šค๋„ˆ ์„ค์ •
  final listener = ref.listen(recorderServiceProvider, (_, state) {
    appState.value = state.currentStatus;
    responseText.value = state.responseText; // null์ผ ๊ฒฝ์šฐ ๋นˆ ๋ฌธ์ž์—ด ํ• ๋‹น
  });

  // ์ปดํฌ๋„ŒํŠธ๊ฐ€ ์–ธ๋งˆ์šดํŠธ๋  ๋•Œ ์‹คํ–‰๋  ํด๋ฆฐ์—… ํ•จ์ˆ˜ ๋ฐ˜ํ™˜
  return () {
    subscription.cancel(); // ๊ตฌ๋… ์ทจ์†Œ
    listener; // ๋ฆฌ์Šค๋„ˆ ๊ตฌ๋… ์ทจ์†Œ
  };
}, []); // ์˜์กด์„ฑ ๋ฐฐ์—ด์ด ๋น„์–ด ์žˆ์œผ๋ฏ€๋กœ, ์ปดํฌ๋„ŒํŠธ๊ฐ€ ๋งˆ์šดํŠธ๋  ๋•Œ ํ•œ ๋ฒˆ๋งŒ ์‹คํ–‰๋ฉ๋‹ˆ๋‹ค.`

Thank you so much!

usilitel commented 3 months ago

@char5742 can you please give an example of how to correctly read .wav file? I can not get it working. If threshold <= 0.9 - vad.predict always returns true. If threshold >= 0.95 - vad.predict always returns false.