Select parameters for oversegmentation and classifier by optimising validation set error

jsherrah commented 11 years ago

trainClassifier.py already does the grid search for classifier parameters. Need a script that searches for the 2 superPixel (SLIC) parameters outside of this.

for P1,P2 (superPixel params) in range of combos:
1. createFeatures.py to generate feature set for training set.
2. trainClassifier.py to grid search for best classifier given these features.
3. run classifyAllImages.sh to generate class prediction images for validation set.
4. run evalPredictions.py on classifier results to get validation set error.
Select optimal P1,P2 to give minimum validation set error.
This gives us the best classifier for the training (and validation) data.

jsherrah commented 11 years ago

Cross-validation takes over my machine and gives it a nipple cripple. Better is to run in a VM, then if I want to reboot I don't lose the current progress.

I've added the directory 'vagrant'. If you cd there, you should be able to do "vagrant up" and it will set up a machine fer ye. If you "vagrant ssh" then "cd /vagrant" you will see the repo there.

jsherrah commented 11 years ago

Need to work out the training-validation-test split now. Then, need to be able to incorp the filenames into createFeatures.py.

jsherrah commented 11 years ago

now running oversegmentation search in the VM.

jsherrah commented 11 years ago

Data set,	Training error,	Validation error
msrcTraining_slic-1000-010.00_ftrs.pkl	0.778127566384	0.648663419271
msrcTraining_slic-1000-015.00_ftrs.pkl	0.780170581154	0.649129264533
msrcTraining_slic-1000-030.00_ftrs.pkl	0.784338941161	0.651351412074
msrcTraining_slic-400-010.00_ftrs.pkl	0.764672590413	0.661881977671
msrcTraining_slic-400-015.00_ftrs.pkl	0.769890358962	0.660832313341
msrcTraining_slic-400-030.00_ftrs.pkl	0.776099102583	0.656877897991

Best results are for 400 super pixels, comapctness of 10. Since they are at the end of the range it is possible the best params are lower than this...but since Shotton et al used the same params, use these.

jsherrah commented 11 years ago

After a day of classifier grid search, I got an 'out of memory' crash! Bugger. I've added a number of jobs parameter, setting to 2 will halve the memory usage. Also fewer param values so smaller grid search.

jsherrah commented 10 years ago

Starting to return to the project. This was the next thing to do. Keeps running out of memory.

jsherrah commented 10 years ago

run completed with 2-fold cross val:

Grid search gave these parameters: max_features : 75 min_samples_split : 10 n_estimators : 500 max_depth : 50 min_samples_leaf : 5

Training randyforest classifier on 70676 examples with grid search param values... Introducing Britains hottest rock performer, Randy Forest! done. Training set accuracy (frac correct) = 0.959887373366 Test set accuracy (frac correct) = 0.681275917065 Output written to file classifier_msrc_rf_400-10_grid.pkl

I will try again with more folds.

jsherrah commented 10 years ago

Another run using 4 fold cross-validation finished:

Restarting trainClassifier.py with arguments: --outfile classifier_msrc_rf_400-10_grid.pkl --type=randyforest --paramSearchFolds=4 --ftrsTest=./vagrant/features/msrcValidation_slic-400-010.00_ftrs.pkl --labsTest=./vagrant/features/msrcValidation_slic-400-010.00_labs.pkl ./vagrant/features/msrcTraining_slic-400-010.00_ftrs.pkl ./vagrant/features/msrcTraining_slic-400-010.00_labs.pkl --nbJobs=6 --rf_max_features=75 --rf_n_estimators=500 --rf_max_depth=50 --rf_min_samples_leaf=5 --rf_min_samples_split=10

Randyforest parameter search grid: {'n_estimators': [50, 150, 500], 'max_features': [5, 75, 'auto'], 'min_samples_split': [10, 100], 'max_depth': [15, 50, None], 'min_samples_leaf': [5, 20, 100]} Fitting 4 folds for each of 162 candidates, totalling 648 fits

Done. Grid search gave these parameters: max_features : 75 min_samples_split : 10 n_estimators : 500 max_depth : None min_samples_leaf : 5

Training randyforest classifier on 70676 examples with grid search param values... Introducing Britains hottest rock performer, Randy Forest! done. Training set accuracy (frac correct) = 0.960028864112 Test set accuracy (frac correct) = 0.681722488038

But it nearly used all 16gb of ram.

jsherrah commented 10 years ago

Those two runs pretty much gave the same best parameters (except max depth). Concerning that the training and validation errors are so different.

jsherrah commented 10 years ago

closing because it's solved for now, but has created a new issue about why training and validation set errors are so different.

RockStarCoders / alienMarkovNetworks

Select parameters for oversegmentation and classifier by optimising validation set error #27