I am trying to train on GTZAN dataset. The accuracy is around 50% which seems very low. I have a question, the mel spectrogram has dimension of 96x2584 for each audio sample. Shall i use the whole sample as one "image" for the CNN network or do I need to divide the audio file into samples like 2048 and use CNN on that one.
I am trying to train on GTZAN dataset. The accuracy is around 50% which seems very low. I have a question, the mel spectrogram has dimension of 96x2584 for each audio sample. Shall i use the whole sample as one "image" for the CNN network or do I need to divide the audio file into samples like 2048 and use CNN on that one.