Hguimaraes / gtzan.keras

[REPO] Music Genre classification on GTZAN dataset using CNNs
MIT License
201 stars 57 forks source link

Train/Test split creation remark #10

Closed lvaleriu closed 4 years ago

lvaleriu commented 5 years ago

Shouldn't you split the files list in train + test before extracting the samples? You are doing the opposite. Extracting all the samples/features and then proceed to split. In this way you can have some portions of song 1 into train split and some other portions of song 1 into test split. This can lead to an invalid validation score. And this doesn't seem very well to me.

I've tried splitting the files list as I've suggested and the model still learns, but the val_loss and val_acc are much lower than you obtain.

What are your thoughts about this?

Hguimaraes commented 5 years ago

@lvaleriu thanks for the comment. And You are totally right, it is a known issue of leaking. Using a correct split method the accuracy decrease ~10/15%.

Few days after I release this source code on GitHub I realized that, but didn't have the time to fix yet. This project has a lot of things to be re-build again. I plan to reformulate everything soon but I'm keeping this repo on Github in order to help someone to have an idea to where to start, not to fully use this in a production environment. The approach using the classical ML methods do not have this problem (I think). If you need to use asap this repo, I suggest you to take the classical approach or reformulate the deep learning version (and probably use an 1D CNN). The lack of the data on the GTZAN dataset is a challenge. On my thesis I also used data from the Spotify and could retraining using the right way.

Cheers,

Hguimaraes commented 4 years ago

Hi @lvaleriu,

Fixed in commit e3f5eba. I updated the packages, the architecture and a lot of conceptual details that was wrong, but the performance is similar. Please check out and contact me if something goes wrong.

Cheers,