Using Flutter_Silero_VAD

MediGenie commented 10 months ago

Hello! thanks for the plugin!

I was wondering how hard would it be to create an api like this using flutter? https://github.com/ricky0123/vad Would it work out of the box? Sorry if this question is basic.

thank you! KJ

char5742 commented 10 months ago

Thank you for your interest in flutter_silero_vad!

I have updated the README.md to include a description of how it works. If there are any points that are unclear, please feel free to ask further questions.

MediGenie commented 10 months ago

Thankyou!🙏🏻

MediGenie commented 10 months ago

Hey @char5742!

So once again thank you for your helpful "How it works" tutorial. So i am referencing this as I have been using it for the web and I am now trying to transition my parameter/settings to your code (https://wiki.vad.ricky0123.com/docs/user/algorithm#configuration)

Parameter using Ricky0123: positiveSpeechThreshold: number - determines the threshold over which a probability is considered to indicate the presence of speech. negativeSpeechThreshold: number - determines the threshold under which a probability is considered to indicate the absence of speech. redemptionFrames: number - number of speech-negative frames to wait before ending a speech segment. frameSamples: number - the size of a frame in samples - 1536 by default and probably should not be changed. preSpeechPadFrames: number - number of audio frames to prepend to a speech segment. minSpeechFrames: number - minimum number of speech-positive frames for a speech segment.

Your code:

`For the initialize method, the arguments are as follows:

modelPath: The path to the Silero VAD onnx model. sampleRate: The sample rate of the audio file you want to detect. frameSize: The size of the segment to detect (Silero VAD is trained with 30ms). threshold minSilenceDurationMs: After it becomes silent, this duration will be included in the detection segment. speechPadMs: Currently not in use. About resetState: Since Silero VAD is an RNN, the model has a state. Calling resetState will reset the model's state.`

Question:

Is each frameSize in 100ms for your code? What are the units?
Is there a way i can "create" similar parameters using your code? Thank you so much again!

Let me know if you have patreon! i would love to donate some coffee!

Greetings from Seoul! KJ

char5742 commented 10 months ago

Hello, ＠Med!

I'm glad you found the "How it works" tutorial helpful! Regarding your questions about transitioning your parameters/settings to the Silero VAD code:

frameSize: In the code, the frameSize parameter represents the duration of the audio window that the model analyzes at once, with the unit being milliseconds (ms). For example, in the C++ sample, a frameSize of 64 is used. This value represents a trade-off; increasing it can improve accuracy but also results in longer processing times. In your documentation, a frameSize of 1536 is used.
Creating Similar Parameters: positiveSpeechThreshold: This would correspond to threshold. negativeSpeechThreshold: This is hardcoded and 0.15 less than the threshold value. redemptionFrames: This would correspond to minSilenceDurationMs. frameSamples: This appears to correspond to frameSize. Parameters like preSpeechPadFrames and minSpeechFrames are not directly processed in your library. To incorporate such features, you may need to implement additional logic outside of the current library functions.

I hope this helps you configure your Silero VAD setup accordingly.

I'm glad to hear that you would consider supporting my work in such a generous way. However, I don't have a Patreon account at the moment, so please just take my assistance as a goodwill gesture. Your appreciation is more than enough for me!

Let me know if you have any further questions or need additional assistance!

MediGenie commented 10 months ago

wow, super. thank you so much for your detailed feedback. I am extremely grateful.

I was wondering if your library has a way to detect/noise cancel audio output from the device. There is this feedback loop where the phone's speaker sound of someone speaking is feeding into VAD.

thank you so much! KJ

char5742 commented 10 months ago

The feedback loop issue you mentioned can indeed be addressed with echo cancellation technology. This technology is used to prevent echoes and feedback caused by sound from the speakers entering the microphone. I have a repository named audio_streamer that includes samples of implementing echo cancellation. This might be helpful in resolving your issue.

I wanted to take a moment to express my heartfelt gratitude for your support through GitHub Sponsors.

MediGenie commented 9 months ago

Hey so thank you so much for this. I am testing the basic app, but the demo app is unable to detect the sound...(maybe the minimum threshold decibel is set too high? or something with the audio input you think?) So it seems like the Flutter VAD was already using a audio streamer of some sort and what i did was replaced it with the one you shared with me 4 days. Can you share some tips or help me debug what I might be the issue? Thank you.

char5742 commented 9 months ago

I apologize, but I have not been able to reproduce the issue on an iPhone 12 with iOS 17.2.1 and a Pixel 6a with Android 14.

Could you please share more details about your environment?

MediGenie commented 9 months ago

Hello! so i am running Android on Galaxy Note 8 and Android 9. I am running iphone 8 plus and 16.7.5. I am sending you recorder.dart below.

` final recorder = AudioStreamer.instance; final vad = FlutterSileroVad(); Future get modelPath async => '${(await getApplicationSupportDirectory()).path}/silero_vad.onnx'; final sampleRate = 16000; final frameSize = 40; // 80ms

/// 샘플당 비트 수 final int bitsPerSample = 16;

/// 채널 수 final int numChannels = 1;

bool isInited = false;

/// 직전의 오디오 데이터를 저장하기 위한 변수 final lastAudioData = [];

/// 음성이 멈춘 후 몇 초 후에 음성 데이터를 저장하는 변수 DateTime? lastActiveTime; final processedAudioStreamController = StreamController<List>(); StreamSubscription<List>? recordingDataSubscription; StreamSubscription<List>? processedAudioSubscription;

AudioPlayer audioPlayer = AudioPlayer(); bool isLoading = false; bool isPlaying = false; bool isThinking = false; AppState _currentStatus = AppState.standby; String _responseText = ''; // 서버 응답 텍스트를 저장할 변수

AppState get currentStatus => _currentStatus; String get responseText => _responseText;

final frameBuffer = [];

Future init() async { var status = await Permission.microphone.request(); if (status != PermissionStatus.granted) { throw Exception('Microphone permission not granted'); }

isInited = true;

}

Future record(StreamController<List> controller, [bool echoCancellation = true]) async { assert(isInited);

final session = await AudioSession.instance;
await session.configure(AudioSessionConfiguration(
  avAudioSessionCategory: AVAudioSessionCategory.playAndRecord,
  avAudioSessionCategoryOptions: AVAudioSessionCategoryOptions.allowBluetooth | AVAudioSessionCategoryOptions.defaultToSpeaker,
  // iOS를 voiceChat으로 설정하면 에코 취소가 활성화됩니다.
  avAudioSessionMode: echoCancellation ? AVAudioSessionMode.voiceChat : AVAudioSessionMode.defaultMode,
  avAudioSessionRouteSharingPolicy: AVAudioSessionRouteSharingPolicy.defaultPolicy,
  avAudioSessionSetActiveOptions: AVAudioSessionSetActiveOptions.none,
  androidAudioAttributes: const AndroidAudioAttributes(
    contentType: AndroidAudioContentType.speech,
    flags: AndroidAudioFlags.none,
    usage: AndroidAudioUsage.voiceCommunication,
  ),
  androidAudioFocusGainType: AndroidAudioFocusGainType.gain,
  androidWillPauseWhenDucked: true,
));

await recorder.startRecording(echoCancellation ? 7 : 0);
await onnxModelToLocal();
await vad.initialize(
  modelPath: await modelPath,
  sampleRate: sampleRate,
  frameSize: frameSize,
  threshold: 0.2,
  minSilenceDurationMs: 500,
  speechPadMs: 0,
);

// 기존 구독이 있으면 취소합니다.
await recordingDataSubscription?.cancel();
await processedAudioSubscription?.cancel();

recordingDataSubscription = recorder.audioStream.listen((buffer) async {
  //debugPrint('buffer length: ${buffer.length}');
  final data = _transformBuffer(buffer);
  if (data.isEmpty || isThinking) return;
  frameBuffer.addAll(buffer);
  while (frameBuffer.length >= frameSize * 2 * sampleRate ~/ 1000) {
    final b = frameBuffer.take(frameSize * 2 * sampleRate ~/ 1000).toList();
    frameBuffer.removeRange(0, frameSize * 2 * sampleRate ~/ 1000);
    await _handleProcessedAudio(b);
  }
  controller.add(data);
});

processedAudioSubscription = processedAudioStreamController.stream.listen((buffer) async {
  if (isPlaying || isThinking) return;
  String outputPath = '${(await getApplicationDocumentsDirectory()).path}/output.wav';
  double duration = saveAsWav(buffer, outputPath);
  debugPrint('duration == $duration');
  if (duration < 0.4) return;
  debugPrint('saved == $outputPath');
  _currentStatus = AppState.thinking;
  isThinking = true;
  String responseAudio = await _uploadFile(outputPath, "");
  if (isLoading) {
    isLoading = false;
    _currentStatus = AppState.speaking;
    await playAudio(responseAudio);
  }
});

}

Future stopRecorder() async { await recorder.startRecording(); if (recordingDataSubscription != null) { await recordingDataSubscription?.cancel(); recordingDataSubscription = null; await processedAudioSubscription?.cancel(); processedAudioSubscription = null; } }

Int16List _transformBuffer(List buffer) { final bytes = Uint8List.fromList(buffer); return Int16List.view(bytes.buffer); }

void printVolume(List data) { // PCM 데이터는 16비트(2바이트)이므로 2바이트 단위로 처리합니다. double sum = 0; for (var i = 0; i < data.length; i += 2) { final int16 = data[i] + (data[i + 1] << 8); // PCM 16비트 데이터 final double sample = int16 / (1 << 15); // -1에서 1까지의 범위로 정규화 sum += sample * sample; // 제곱합 계산 }

final double rms = sqrt(sum / (data.length / 2)); // RMS 계산
final double volume = 20 * log(rms) / ln10; // 데시벨로 변환

debugPrint('Volume: $volume dB');

}

static const threshold = 900; // 이 임계값은 음성 레벨에 따라 조정이 필요 static const bufferTimeInMilliseconds = 700; final audioDataBuffer = [];

Future _handleProcessedAudio(List buffer) async { final transformedBuffer = _transformBuffer(buffer); final transformedBufferFloat = transformedBuffer.map((e) => e / 32768).toList();

final isActivated = await vad.predict(Float32List.fromList(transformedBufferFloat));
//debugPrint(isActivated.toString());
if (isActivated == true) {
  if (!isPlaying) {
    _currentStatus = AppState.listening;
  }
  lastActiveTime = DateTime.now();
  audioDataBuffer.addAll(lastAudioData);
  lastAudioData.clear();
  audioDataBuffer.addAll(buffer);
  if (isPlaying) {
    isPlaying = false;
    audioPlayer.stop();
  }
} else if (lastActiveTime != null) {
  audioDataBuffer.addAll(buffer);
  debugPrint(DateTime.now().difference(lastActiveTime!).toString());
  // 일정 시간이 지나면 음성 데이터 저장
  if (DateTime.now().difference(lastActiveTime!) > const Duration(milliseconds: bufferTimeInMilliseconds)) {
    processedAudioStreamController.add([...audioDataBuffer]);
    audioDataBuffer.clear();
    lastActiveTime = null;
  }
} else {
  // 음성이 없는 상태
  lastAudioData.addAll(buffer);
  // 5초분의 데이터를 저장해 둔다
  final threshold = sampleRate * 500 ~/ 1000;
  if (lastAudioData.length > threshold) {
    lastAudioData.removeRange(0, lastAudioData.length - threshold);
  }
}

}`

I am using this Recorder.dart here.

`Widget build(BuildContext context, WidgetRef ref) { final recorderServiceState = ref.watch(recorderServiceProvider); final appState = useState(AppState.standby); final responseText = useState("");

String lottieFile = 'assets/lottie/lottie1.json';
String statusText = "I am waiting for you to speak...";

_currentStatus = recorderServiceState.currentStatus;

switch (_currentStatus) {
  case AppState.standby:
    lottieFile = 'assets/lottie/lottie1.json';
    statusText = "I am waiting for you to speak...";
    break;
  case AppState.listening:
    lottieFile = 'assets/lottie/lottie2.json';
    statusText = "I'm listening";
    break;
  case AppState.thinking:
    lottieFile = 'assets/lottie/lottie3.json';
    statusText = "I'm thinking";
    break;
  case AppState.speaking:
    lottieFile = 'assets/lottie/lottie4.json';
    statusText = 'I am speaking';
    break;
}

//debugPrint(recorderServiceState.currentStatus.toString());

final controller = useStreamController<List<int>>();
final spots = useState<List<int>>([]);
useOnAppLifecycleStateChange((beforeState, currState) {
  if (currState == AppLifecycleState.resumed) {
    ref.read(recorderServiceProvider).record(controller);
  } else if (currState == AppLifecycleState.paused) {
    ref.read(recorderServiceProvider).stopRecorder();
  }
});
useEffect(() {
  // 토큰을 가져오고 레코더 초기화 및 레코딩 시작
  Future<void> initializeAndStartRecording() async {
    if (SharedPreferencesManager.getString(TOKEN) == null) {
      // UUID 생성
      var uuid = const Uuid();
      final String identifier = uuid.v4();
      var token = await NetworkManager().fetchToken(identifier);
      if (token != null) {
        debugPrint("Network Token: $token");
        await SharedPreferencesManager.setString(TOKEN, token);
        try {
          // recorderService 초기화 및 레코딩 시작
          await ref.read(recorderServiceProvider).init();
          debugPrint("Recorder initialized");
          await ref.read(recorderServiceProvider).record(controller);
          debugPrint("Recorder started");
        } catch (e) {
          debugPrint("Error initializing or starting the recorder: $e");
        }
      } else {
        debugPrint("Token fetching failed");
        return;
      }
    } else {
      debugPrint("Token: ${SharedPreferencesManager.getString(TOKEN)!}");
      NetworkManager().setBearerToken(SharedPreferencesManager.getString(TOKEN)!);
      try {
        // recorderService 초기화 및 레코딩 시작
        await ref.read(recorderServiceProvider).init();
        debugPrint("Recorder initialized");
        await ref.read(recorderServiceProvider).record(controller);
        debugPrint("Recorder started");
      } catch (e) {
        debugPrint("Error initializing or starting the recorder: $e");
      }
    }
  }

  // 함수 실행
  initializeAndStartRecording();

  // controller의 스트림에 대한 리스너 설정
  final subscription = controller.stream.listen((event) {
    final buffer = event.toList();
    spots.value = buffer;
  });

  // 리스너 설정
  final listener = ref.listen(recorderServiceProvider, (_, state) {
    appState.value = state.currentStatus;
    responseText.value = state.responseText; // null일 경우 빈 문자열 할당
  });

  // 컴포넌트가 언마운트될 때 실행될 클린업 함수 반환
  return () {
    subscription.cancel(); // 구독 취소
    listener; // 리스너 구독 취소
  };
}, []); // 의존성 배열이 비어 있으므로, 컴포넌트가 마운트될 때 한 번만 실행됩니다.`

Thank you so much!

usilitel commented 4 months ago

@char5742 can you please give an example of how to correctly read .wav file? I can not get it working. If threshold <= 0.9 - vad.predict always returns true. If threshold >= 0.95 - vad.predict always returns false.

char5742 / flutter_silero_vad

Using Flutter_Silero_VAD #5