Closed jsherrah closed 11 years ago
Cross-validation takes over my machine and gives it a nipple cripple. Better is to run in a VM, then if I want to reboot I don't lose the current progress.
I've added the directory 'vagrant'. If you cd there, you should be able to do "vagrant up" and it will set up a machine fer ye. If you "vagrant ssh" then "cd /vagrant" you will see the repo there.
Need to work out the training-validation-test split now. Then, need to be able to incorp the filenames into createFeatures.py.
now running oversegmentation search in the VM.
Data set, | Training error, | Validation error |
---|---|---|
msrcTraining_slic-1000-010.00_ftrs.pkl | 0.778127566384 | 0.648663419271 |
msrcTraining_slic-1000-015.00_ftrs.pkl | 0.780170581154 | 0.649129264533 |
msrcTraining_slic-1000-030.00_ftrs.pkl | 0.784338941161 | 0.651351412074 |
msrcTraining_slic-400-010.00_ftrs.pkl | 0.764672590413 | 0.661881977671 |
msrcTraining_slic-400-015.00_ftrs.pkl | 0.769890358962 | 0.660832313341 |
msrcTraining_slic-400-030.00_ftrs.pkl | 0.776099102583 | 0.656877897991 |
Best results are for 400 super pixels, comapctness of 10. Since they are at the end of the range it is possible the best params are lower than this...but since Shotton et al used the same params, use these.
After a day of classifier grid search, I got an 'out of memory' crash! Bugger. I've added a number of jobs parameter, setting to 2 will halve the memory usage. Also fewer param values so smaller grid search.
Starting to return to the project. This was the next thing to do. Keeps running out of memory.
run completed with 2-fold cross val:
Grid search gave these parameters: max_features : 75 min_samples_split : 10 n_estimators : 500 max_depth : 50 min_samples_leaf : 5
Training randyforest classifier on 70676 examples with grid search param values... Introducing Britains hottest rock performer, Randy Forest! done. Training set accuracy (frac correct) = 0.959887373366 Test set accuracy (frac correct) = 0.681275917065 Output written to file classifier_msrc_rf_400-10_grid.pkl
I will try again with more folds.
Another run using 4 fold cross-validation finished:
Restarting trainClassifier.py with arguments: --outfile classifier_msrc_rf_400-10_grid.pkl --type=randyforest --paramSearchFolds=4 --ftrsTest=./vagrant/features/msrcValidation_slic-400-010.00_ftrs.pkl --labsTest=./vagrant/features/msrcValidation_slic-400-010.00_labs.pkl ./vagrant/features/msrcTraining_slic-400-010.00_ftrs.pkl ./vagrant/features/msrcTraining_slic-400-010.00_labs.pkl --nbJobs=6 --rf_max_features=75 --rf_n_estimators=500 --rf_max_depth=50 --rf_min_samples_leaf=5 --rf_min_samples_split=10
Randyforest parameter search grid: {'n_estimators': [50, 150, 500], 'max_features': [5, 75, 'auto'], 'min_samples_split': [10, 100], 'max_depth': [15, 50, None], 'min_samples_leaf': [5, 20, 100]} Fitting 4 folds for each of 162 candidates, totalling 648 fits
Done. Grid search gave these parameters: max_features : 75 min_samples_split : 10 n_estimators : 500 max_depth : None min_samples_leaf : 5
Training randyforest classifier on 70676 examples with grid search param values... Introducing Britains hottest rock performer, Randy Forest! done. Training set accuracy (frac correct) = 0.960028864112 Test set accuracy (frac correct) = 0.681722488038
But it nearly used all 16gb of ram.
Those two runs pretty much gave the same best parameters (except max depth). Concerning that the training and validation errors are so different.
closing because it's solved for now, but has created a new issue about why training and validation set errors are so different.
trainClassifier.py already does the grid search for classifier parameters. Need a script that searches for the 2 superPixel (SLIC) parameters outside of this.