dscripka / openWakeWord

An open-source audio wake word (or phrase) detection framework with a focus on performance and simplicity.
Apache License 2.0
746 stars 71 forks source link

Poor Performance with automatic_model_training.ipynb #110

Open lea-xtend-ai opened 9 months ago

lea-xtend-ai commented 9 months ago

Hi,

I tried to use automatic_model_training.ipynb but encountered significant issues, resulting in a model that does not work effectively at all.

Configuration YAML Used

{'augmentation_batch_size': 16,
 'augmentation_rounds': 1,
 'background_paths': ['./audioset_16k', './fma'],
 'background_paths_duplication_rate': [1],
 'batch_n_per_class': {'ACAV100M_sample': 1024,
  'adversarial_negative': 50,
  'positive': 50},
 'custom_negative_phrases': [],
 'false_positive_validation_data_path': 'validation_set_features.npy',
 'feature_data_files': {'ACAV100M_sample': 'openwakeword_features_ACAV100M_2000_hrs_16bit.npy'},
 'layer_size': 32,
 'max_negative_weight': 1500,
 'model_name': 'alice',
 'model_type': 'dnn',
 'n_samples': 10000,
 'n_samples_val': 2000,
 'output_dir': './my_custom_model',
 'piper_sample_generator_path': './piper-sample-generator',
 'rir_paths': ['./mit_rirs'],
 'steps': 50000,
 'target_accuracy': 0.7,
 'target_false_positives_per_hour': 0.2,
 'target_phrase': ['alice'],
 'target_recall': 0.5,
 'tts_batch_size': 50}

Output Results:

Final Model Accuracy: 0.6267499923706055
Final Model Recall: 0.25699999928474426
Final Model False Positives per Hour: 1.0619468688964844

I tested using the detect_from_microphone script.

On the other hand, I successfully trained using training_models.ipynb with synthetic_speech_dataset_generation for generating 10,000 samples (5000 for each model), and the results were fine without downloading the full negative data.

Questions

  1. What might be causing the disparity in performance between the two notebooks?
  2. For training_models.ipynb: 2.1. Do I need to download the entire negative dataset, and if so, what is the total size in GB? 2.2. Is generating 50,000 positive samples for each model necessary?
  3. Any additional advice or constructive feedback would be greatly appreciated.

Thank you!

lea-xtend-ai commented 9 months ago

i change sample to 100,000 it was even worse

{'augmentation_batch_size': 16,
 'augmentation_rounds': 1,
 'background_paths': ['./audioset_16k', './fma'],
 'background_paths_duplication_rate': [1],
 'batch_n_per_class': {'ACAV100M_sample': 1024,
  'adversarial_negative': 50,
  'positive': 50},
 'custom_negative_phrases': [],
 'false_positive_validation_data_path': 'validation_set_features.npy',
 'feature_data_files': {'ACAV100M_sample': 'openwakeword_features_ACAV100M_2000_hrs_16bit.npy'},
 'layer_size': 32,
 'max_negative_weight': 1500,
 'model_name': 'alice',
 'model_type': 'dnn',
 'n_samples': 100000,
 'n_samples_val': 2000,
 'output_dir': './alice',
 'piper_sample_generator_path': './piper-sample-generator',
 'rir_paths': ['./mit_rirs'],
 'steps': 50000,
 'target_accuracy': 0.7,
 'target_false_positives_per_hour': 0.2,
 'target_phrase': ['alice'],
 'target_recall': 0.5,
 'tts_batch_size': 50}
dscripka commented 9 months ago

Those numbers seem reasonable based on my past experience, so it's odd that the model isn't working well for you in practice. Can you share the trained model file, and/or the notebook you used to train the model that performs well?

As for the other questions:

1) The automatic model training notebook tries to use reasonable defaults and automation to set hyperparameters to simplify the training process. From my somewhat limited testing this works well most of the time, but it's not surprising to see that in some cases using a more manual process can produce a better model.

2) In general, more negative sample is better, but there are diminishing returns. In my own testing I have terabytes of negative data for different experiments, but I doubt all of it is needed. As for positive examples, usually between 20,000 and 50,000 is sufficient, but sometimes more can help (it depends on the model, from what I've seen).

lea-xtend-ai commented 9 months ago

I utilized the training_models.ipynb notebook from the repository and made modifications to filter positive clips. Here's the specific change I made:

positive_clips, durations = openwakeword.data.filter_audio_paths(
    [
        "pos_data/alice/VITS/",
    ],
    min_length_secs = 0.5, # minimum clip length in seconds
    max_length_secs = 2.9, # maximum clip length in seconds
    duration_method = "header" # use the file header to calculate duration
)

print(f"{len(positive_clips)} positive clips after filtering, representing ~{sum(durations)//3600} hours")

The output shows that there are 48796 positive clips after filtering, representing approximately 8.0 hours.

This is my second experiment with 50,000 generated examples. The first experiment, which involved 10,000 examples from VITS and WAVEGLOW, worked fine.

Unfortunately, I cannot attach the model file due to GitHub's limitations on file types. Is it possible for me to send it via email instead?

Thank you for your assistance!

sanjuktasr commented 9 months ago

@lea-xtend-ai Did you use the training_models.ipynb exactly as it is? even the same model as mentioned there? also what do you mean by not downloading the full negative data?

dscripka commented 9 months ago

I utilized the training_models.ipynb notebook from the repository and made modifications to filter positive clips. Here's the specific change I made:

positive_clips, durations = openwakeword.data.filter_audio_paths(
    [
        "pos_data/alice/VITS/",
    ],
    min_length_secs = 0.5, # minimum clip length in seconds
    max_length_secs = 2.9, # maximum clip length in seconds
    duration_method = "header" # use the file header to calculate duration
)

print(f"{len(positive_clips)} positive clips after filtering, representing ~{sum(durations)//3600} hours")

The output shows that there are 48796 positive clips after filtering, representing approximately 8.0 hours.

This is my second experiment with 50,000 generated examples. The first experiment, which involved 10,000 examples from VITS and WAVEGLOW, worked fine.

Unfortunately, I cannot attach the model file due to GitHub's limitations on file types. Is it possible for me to send it via email instead?

Thank you for your assistance!

@lea-xtend-ai can you put the model file into an archive (e.g., zip or tar)? That might help you attach it to this issue. See more details here: https://docs.github.com/en/get-started/writing-on-github/working-with-advanced-formatting/attaching-files

lea-xtend-ai commented 9 months ago

I have a model that I trained using the training_models notebook, and it's working more or less fine. However, I've noticed a few false positives, especially with words containing the "al" combination where the "s" sound is less crucial for recognizing "Alice."

The model file is named alice_v5.zip

Interestingly, I observed that the model created by automatic_model_training.ipynb is much lighter, weighing only 206 KB, whereas the one from training_models is 351 KB.

image

thank you again!

twitchyliquid64 commented 8 months ago

I've also observed sub-usual performance on some words: in my latest case 'apartment'. I can upload an example if that would help as well.

dscripka commented 8 months ago

@lea-xtend-ai testing the alice_v1.onnx model myself, I broadly agree with you. It performs reasonably well, but does have some false positives, especially for words that are very similar to "alice" (e.g., "malice", "chalice", "callus", etc.).

This makes sense to a certain extent, as the training_models notebook does not included adversarial speech, while the automatic model training notebook does. That is, the process attempts to find words that sound similar to the target wakeword, and includes those in the training data. However, because this process is automated it doesn't always work as expected.

If you want to explore the automatic training process further, there is an an option in the YAML config file here that allows you to specify specific adversarial negative phrases. This can greatly improve performance in cases where you know that certain words/phrases lead to false activations.

dscripka commented 8 months ago

I've also observed sub-usual performance on some words: in my latest case 'apartment'. I can upload an example if that would help as well.

@twitchyliquid64 if you are noticing too many false-positives with the "apartment" wakeword, I would recommend the same approaches mentioned above.

lea-xtend-ai commented 8 months ago

@dscripka Yes, I agree with you, but I haven't been able to get the "automatic model training notebook" to work properly, regardless of the parameters I choose.