Caldarie / flutter_tflite_audio

Audio classification Tflite package for flutter (iOS & Android). Can support Google Teachable Machine models
MIT License
63 stars 24 forks source link

It always matches the sound to the first category listed #10

Closed eyalis closed 2 years ago

eyalis commented 3 years ago

When I try it, it always interprets the sound as the first category listed in the file

Caldarie commented 3 years ago

Hi @eyalis,

To help better assist with your problem, can you provide more information?

  1. Are you using an emulator or device?
  2. Are you running the simulation on iOS/android?
  3. Are you using your own model or the example provided?
  4. What is your bufferSize?
eyalis commented 3 years ago

Thank you @Caldarie for your quick answer, here it goes:

Describe the problem: When I try the example it always matches the first word of the category file.

  1. Are you using a device or emulator?: I'm using a device: Samsung Galaxy S7 (the app requested me for permission and those were allowed).
  2. Are you running the simulation on iOS/android? I'm running it on android
  3. Are you using your own model or the example provided? I used the example provided first and I got that problem. Then I replaced the example model by my own model and I got the same problem.
  4. What is your bufferSize? 11050 (I'm using a teachable machine model)

Thanks for your help!

Caldarie commented 3 years ago

Hmm, I’m that case:

  1. Have you tried adjusting your bufferSize around? For example, 22050, 5000 etc
  2. Can you tell me your sample rate?
  3. Do you still get the same result if you run the “decodedWav” model in the example?
eyalis commented 3 years ago

Thanks, so:

  1. Have you tried adjusting your bufferSize around? For example, 22050, 5000 etc Yes, I tried with: 22050, 5000, 7000, 30000, etc.

  2. Can you tell me your sample rate? I'm using the one that comes with the example which is 44100

  3. Do you still get the same result if you run the “decodedWav” model in the example No, that one works pretty good!

Caldarie commented 3 years ago

Looking at your answers, we can rule down to one possibility.

I think the most likely reason why you’re not getting a response, is because your model isn’t trained well enough to distinguish your desired sound to other sounds/background noise.

Make sure that you increase the number of training examples and remove any prolong silences. (The goal is to make it as distinct as possible). Also for your background noise, you can add variations such as white noise, ambient noise etc. Try not to mix your voice in the background noise.

As for the reason why the GTM model in the example is not working for you.. to be honest, it was hastily put together with minimal training examples; so unless you say the words very distinctly, it will not register.

eyalis commented 3 years ago

Thank you for your answer, my model works well in the teachable machine (I have tested it with the tool they provide in the platform), I think it may be another problem, maybe related to my device configuration or something like that.

Thank you anyway, I'll keep trying and if I find what was the problem/solution, I'll let you know!!!

Caldarie commented 3 years ago

Ah, i see. In that case, maybe you can tinker around with the sample rate. I heard that some people had better results with 16000

Caldarie commented 3 years ago

Hi @eyalis,

I've updated the package (0.1.7) where you can now adjust the detection threshold of the model. I believe this should help fix the issue you are having. For example:

If your model is frequently matching the first category, simply adjust detectionThreshold to a lower value.

Example on how to implement this:

final String model = 'assets/google_teach_machine_model.tflite';
final String label = 'assets/google_teach_machine_label.txt';
final String inputType = 'rawAudio';
final int sampleRate = 44100;
final int recordingLength = 44032;
final int bufferSize = 22050;
final int numOfInferences = 1;
final double detectionThreshold =  0.4; //Adjust this value here

Let me know if this fixes your problem

eyalis commented 3 years ago

Hi @Caldarie thank you so much, I'll try it asap and let you know!

cmalbuquerque commented 3 years ago

@Caldarie the same is happening to me. The inference always returns the first category 😕 I tested with both versions 0.6.2 and 0.7.1

Caldarie commented 3 years ago

Hi @cmalbuquerque,

For testing purposes, can you tell me if you still see the same problem after lowering your detectionThreshold to maybe 0?

cmalbuquerque commented 3 years ago

@Caldarie yes, I am using the following settings:

result = TfliteAudio.startAudioRecognition(
      numOfInferences: 1,
      inputType: 'rawAudio',
      sampleRate: 16000,
      recordingLength: 44032,
      bufferSize: 2000,
      detectionThreshold: 0,
    );

My sample rate is low as well as the bufferSize but I need to use these values in order to increase the time of recording. I will try to use other model with sounds more distinct and then I leave feedback here

Caldarie commented 3 years ago

@cmalbuquerque

No problems at all, let me know the results of your other model.

Just be aware that detection threshold ignores any predictions where it’s probability doesn’t exceed the set value. For example:

If your detection threshold is set as 0.5 (50%) and your model predicts the audio label as “yes” at 0.4 (40%) and “no” at 0.1 (10%), your prediction will be ignored and hence the problem where the label matches the first label.

As mentioned in comment #10 to @eyalis, it’s highly probable that the model isn’t trained well enough and you might need to train more data to rectify this issue .

Alternatively, you can lower your detection threshold, but I don’t advise going any lower than 0.30, as you are forgoing your models performance.

cmalbuquerque commented 3 years ago

@Caldarie, Many thanks for the help and the detailed explanation! I used other model with detectionThreshold: 0.4 and it works perfectly 😁

Caldarie commented 3 years ago

@cmalbuquerque

I’m very glad to hear that! 😁

Caldarie commented 3 years ago

It has come to my attention that this problem may also be due to bufferSize.

Its possible that a higher bufferSize doesn't allow your device enough time to record, hence the problem where it always match the first category.To fix this, simply adjust the bufferSize to a lower value.

Likewise, if your bufferSize is too low, the recording length will be too long and your model may possibly register it as background noise. Simply adjust the bufferSize to a higher value.

amardeep-singh97 commented 3 years ago

@Caldarie Thank you for this awesome package. I am facing the same issue. I tried all of the above solutions, none of them worked. I am thinking I might made some mistake while training the model on GTM. Can you suggest what should be "delay" and "duration" in class while training.

Caldarie commented 3 years ago

Hi @amardeep-singh97,

To assist you with your problem, will it be possible for your to share your model?

amardeep-singh97 commented 3 years ago

@Caldarie Yes Here you go, Here we will get class 2 on saying "Help". converted_tflite (1).zip

Caldarie commented 3 years ago

Hi @amardeep-singh97

I ran through your example, and it seems that your were right about your model's accuracy (even after tuning the parameters). However, I have noticed that your model does at times pick up the sound after some slight delay but performs poorly when speaking too quickly or slowly.

  1. In that case, may i ask as to how many samples you have trained?
  2. Have you considered training examples where you speak at different times of the recording length? For example, "help" at around 1.50 second or 0.25 second mark.
  3. You may consider adding background noise to the sound 'help' to assist your model in differentiate with 'just' background noise.

Let me know how this works for you.

amardeep-singh97 commented 3 years ago

@Caldarie I think that too, i trained it with 60 background noise samples and 40 help samples. Sample size is 1 second on GTM. Can you suggest improvements

Caldarie commented 3 years ago

@amardeep-singh97

Let me run some tests with the source code. As mentioned above, it seems that many people are experiencing the same problem as you.

Oddly enough, this seems to only be a problem with GTM models.

amardeep-singh97 commented 3 years ago

@Caldarie Okay. Thank you for your support. Let us know what you find.

Caldarie commented 3 years ago

Hi @amardeep-singh97,

I have tested my own GTM model provided in this repository, and had much better result with the following values:

final double detectionThreshold = 0.3;
final int averageWindowDuration = 0; 
final int minimumTimeBetweenSamples = 0;
final int suppressionTime = 0;
final int minimumCount = 0;

Unfortunately, I haven't had much success with the model you have provided. It seems that your model (on average) considers most of the voice input "help" at around 20% confidence. I think the most likely explanation that your model has difficulty differentiating with background noise is because there's a lot of space in the recording length. For example:

Audio input takes up a small portion of recording length

Screen Shot 2021-07-17 at 19 39 09

Background noise

Screen Shot 2021-07-17 at 19 39 14

As you can see in the first picture, it is very similar to the second picture, albeit the small recording input. Have you tried trimming this space?

amardeep-singh97 commented 3 years ago

@Caldarie I am training a new model with more variation in data. I will let you know if i face any issues. I'll consider arguments in your solution.

Caldarie commented 3 years ago

@amardeep-singh97 keep us update! :)

fabian-rump commented 3 years ago

I have the same problem but only on one device. I created a model with Google Teachable Machine and tested this on two devices:

Samsung Galaxy S9 Plus The first label is always detected here. The logged raw scores are: [NaN, NaN, NaN]

Samsung Galaxy S20 The detection works perfectly here

Both were tested under the same conditions.

Any suggestions?

Caldarie commented 3 years ago

Hi @fabian-rump,

That’s a very peculiar problem.

With the S20, will it from time to time output NaNs?

As for S9, does this device always output NaNs? Or does it, from time to time, output a score?

fabian-rump commented 3 years ago

The S20 outputs a NaN in one of hundreds of cases. The S9 always outputs NaN. I haven't yet been able to get a score on the S9.

Caldarie commented 2 years ago

Hey guys, I have found the cause and fixed the problem. Seems like it was right under my nose. Will now close the issue

fradricast commented 1 year ago

How to get value raw score and recognition result tfliteaudio at the same ?