chrispla / mir_ref

A Representation Evaluation Framework for Music Information Retrieval tasks
42 stars 1 forks source link

Add MFCC Feature Extraction with Flattened Embedding Structure #2

Open JoaoSartoreto opened 1 month ago

JoaoSartoreto commented 1 month ago

Pull Request Documentation for MFCC Extraction

Objective

I am working to integrate MFCC extraction into the training pipeline, ensuring it aligns with the structure expected by the model. Currently, high memory consumption and model errors are occurring during training, which suggests the structure of the saved MFCC features may not match the model’s expectations.

This pull request outlines the MFCC extraction method implemented, along with several adjustments I’ve tried to align with the model’s requirements. My hope is to receive feedback on the correct structural format for these embeddings.

Implemented MFCC Extraction Code

The implemented extraction code captures MFCCs, deltas, and delta-deltas, ensuring a consistent frame length of 500, in line with the model’s expected dimensions. Key steps include:

  1. Audio Loading: Each audio file is loaded at 22,050 Hz.
  2. MFCC Extraction: We extract 13 MFCC coefficients.
  3. Delta Calculations: First and second-order deltas are computed.
  4. Feature Combination: MFCCs, deltas, and delta-deltas are stacked into a (39, 500) feature matrix.
  5. Padding or Truncation: Padding is added or data truncated as needed to reach 500 frames.
  6. Flattening and Saving: Features are flattened to a (19500,) shape before saving to .npy files. It is worth mentioning that if flattening is not applied, training fails with the error:

    FAILED_PRECONDITION: Python interpreter state is not initialized. The process may be terminated. [[{{node PyFunc}}]]


Approaches Tried

Here are the main approaches I’ve attempted, each aiming to align the MFCC structure with the model’s expectations.

1. Flattened Structure of (19500,)

In the first approach, I saved the MFCC features as a flattened array of size (19500,), assuming the model would reshape this automatically to the original matrix dimensions of (39, 500).


2. Data Transposition Before Flattening

Based on my advisor's suggestion, I also tried transposing the (39, 500) matrix to (500, 39) before flattening, in hopes of aligning the data with any specific format requirements within the model.


3. StandardScaler for Consistent Normalization

To ensure uniform scale across all audio files in the dataset, I applied StandardScaler to the feature matrix before flattening. This approach was intended to reduce any scaling inconsistencies that might occur from different amplitude levels or variations in the original audio data, which could potentially impact model processing. By applying StandardScaler, each feature was normalized to have a mean of zero and a standard deviation of one.


4. Consistency with VGG and CLMR Feature Structures

Since the model successfully processes embeddings from VGG and CLMR without excessive memory usage, I hypothesized that adapting the structure of the flattened MFCC array to resemble the formats used by these extractors might help. The goal was to see if aligning the MFCC structure with that of other feature types would allow the model to handle the MFCC data more efficiently, potentially reducing memory consumption by aligning with the model’s internal data handling expectations.


Next Steps and Feedback Needed

Each of these approaches resulted in high memory consumption and training errors, likely due to a mismatch in the structural format of the MFCC embeddings. Without flattening, the model fails to train and returns the error:

FAILED_PRECONDITION: Python interpreter state is not initialized. The process may be terminated. [[{{node PyFunc}}]]

Could you please take a look at the code provided and offer feedback on the specific structure the training model expects for these features?

Thank you very much for your time and guidance.