Sound detection - Is this possible? (TFMIC-16)

gamename commented 9 months ago

Hi,

I'm using an esp32-s3-eye v2.2. It has 8MB each of flash and PSRAM. Is it possible to use yamnet.tflite on an esp32-s3-eye v2.2 for sound identification? The yamnet.tflite file is about 3.9M in size.

The chip has an sd card slot, so I can use it to load the model file (i.e. no need to convert it to a .cc file with xxd.

Thoughts?

vikramdattu commented 9 months ago

@gamename going by the size of the model, it is a quantised model I believe. If not, I would suggest you to quantise it to int8 weights. That'll reduce the size of the model to about 1/4th. Have you tried it with .cc first? Converting to .cc doesn't really increase the size of the model when the file get's embedded into the application, if you are concerned about that. The .cc file looks larger but the array size it puts the model in is still smaller.

About SD card, unfortunately, I have not really tried this approach. It definitely is a worth of a try IMO. When loading from the SD card however, it makes sense to not convert to .cc as you suggest.

Let me know how it goes. If you need further help or want me to try, do let me know.

gamename commented 9 months ago

@vikramdattu

What process did you use to build the yes_micro_features_data.cc file? I'm not referring to the xxd conversion. I'm referring to everything up to that. :)

The reason I ask is the C array in yes_micro_features_data.cc is tiny. I would like to replicate that size for my cat meow identification too.

Thanks -T

vikramdattu commented 9 months ago

Hi @gamename this is test data and I had taken it long back from googles's tflite-micro. Currently, the feature generation happens via a different model in this file and the features are then fed to detection model.

The tools here can help you train your own model, evaluate it and convert it.

gamename commented 9 months ago

Hi @gamename this is test data and I had taken it long back from googles's tflite-micro. Currently, the feature generation happens via a different model in this file and the features are then fed to detection model.

The tools here can help you train your own model, evaluate it and convert it.

Thank you, sir.

gamename commented 8 months ago

@vikramdattu

Is your pre-processor model taken from here?

The reason I ask is because it seems the pre-processor should work for "meow" as well as human speech. It just generates spectrograms. That's payload agnostic (i.e., it just makes a spectrogram of a sound and doesn't care what sound it is). Correct?

Thanks, -T

vikramdattu commented 8 months ago

@gamename that's right, the model is taken from that particular location.

gamename commented 8 months ago

@gamename that's right, the model is taken from that particular location.

Perfect. Thanks.

gamename commented 8 months ago

@vikramdattu

for micro_speech example, what is the purpose of having yes_micro_features_data.cc/h and no_micro_features_data.cc/h in the directory? Are they there for reference? They don't seem to be used - or am I missing something?

vikramdattu commented 8 months ago

@gamename you are correct. Those were there from old days added for testing and are not used currently. You may ignore those.

gamename commented 8 months ago

@gamename you are correct. Those were there from old days added for testing and are not used currently. You may ignore those.

Thanks!

gamename commented 8 months ago

@vikramdattu

This concerns building the actual model. I am using a script here that is just a compilation of the steps outlined here.

Here is what my input dir with samples looks like:

tree ./samples
./samples
├── _background_noise_
│   ├── README.md
│   ├── doing_the_dishes.wav
│   ├── dude_miaowing.wav
│   ├── exercise_bike.wav
│   ├── pink_noise.wav
│   ├── running_tap.wav
│   └── white_noise.wav
└── meow
    ├── cat0001.wav
    ├── cat0002.wav
    ...

(there are 77 total cat .wav files)

I'm confused about what needs to be in there. Do I need to add a silence and unknown subdir (with contents) as well?

Thanks -T

gamename commented 8 months ago

@vikramdattu

Another question. :)

Looking at this construct:


constexpr int kCategoryCount = 4;
constexpr const char* kCategoryLabels[kCategoryCount] = {
    "silence",
    "unknown",
    "yes",
    "no",
};```

...how do you know what the order of the labels ("silence", "unknown", etc) should be?  How is that set?

vikramdattu commented 8 months ago

Hello TennisSmith,

That completely depends on the model trained. It cannot be inferred from the model what categories are. Only the number of categories can be known from output tensor size.

Thanks., Vikram

On 14-Mar-2024, at 3:17 AM, Tennis Smith @.***> wrote:

[External: This email originated outside Espressif]

@vikramdattuhttps://github.com/vikramdattu

Another question. :)

Looking at this constructhttps://github.com/espressif/esp-tflite-micro/blob/61af88b7b30fda2078a7b52b2c1b600899a73e2e/examples/micro_speech/main/micro_model_settings.h#L31:

constexpr int kCategoryCount = 4; constexpr const char* kCategoryLabels[kCategoryCount] = { "silence", "unknown", "yes", "no", };```

...how do you know what the order of the labels ("silence", "unknown", etc) should be? How is that set?

— Reply to this email directly, view it on GitHubhttps://github.com/espressif/esp-tflite-micro/issues/74#issuecomment-1995919478, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ABKBURYJORLN5LS6HFAZIN3YYDCN7AVCNFSM6AAAAABD3ASOXWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSOJVHEYTSNBXHA. You are receiving this because you were mentioned.Message ID: @.***>

gamename commented 8 months ago

That completely depends on the model trained. It cannot be inferred from the model what categories are. Only the number of categories can be known from output tensor size.

That's not quite what I am asking. :)

My question is this: How do I know the order of the labels as they are used in python after the model has been created?

espressif / esp-tflite-micro

Sound detection - Is this possible? (TFMIC-16) #74