As it was mentioned, our original data holds half a million observations with a few dozen features, most categorical, so accurate feature selection and model selection were extremely important. Especially because model training took significant amount of computational resources.
Since we could not efficiently use automated feature selection like RFE or FFE (because of time / resources constraint), we had to perform manual feature selection. As we had some intuition in the target area as well as some practical experience, we were able to prune our feature list to just 12 most important on our opinion:
10 categorical features:
manufacturer (brand)
transmission type
fuel type
paint color
number of cylinders
drive type (AWD / FWD / RWD)
size
condition
title_status
state
2 continuous features:
year
odometer
Based on our EDA and assumptions, we picked a number of models to fit our train data. Since training and validating took a lot of resources, we performed it on a gradually increasing subsets of training data. See the results below, sorted by validation score (increasing):
Model
Training Score
Validation Score
Linear Regression
0.555803
0.526354
Stochastic Gradient Descent
0.550439
0.528612
kNN
0.638008
0.626848
Random Forests
0.964447
0.734342
Gradient Boosted Trees
0.803595
0.736818
Support Vector Machines
0.840271
0.813724
For each model we performed 5-fold-cross-validated grid search involving a range of most important model-specific hyper-parameters.
Interpretation and Critique of the Results
Since our features are mostly categorical with numerous levels, we realized it will probably be hardly linearly separable, and best models would probably be kNN, ensemble methods (like random forests and boosted trees) and SVM RBF. And the results lined up with our expectations. We believe SVM RBF worked best because our data is highly clustered, and that's where this model performs great. Random / boosted forests were also expected to work well due to the same reason, but SVM still performed better eventually. The good sign was also that it did not overfit greatly on train set, which was a good sign to perform further testing.
Since SVM shown the best results from the very beginning, we performed a thorough adaptive grid search on a smaller subset of data and then validating the optimum parameters on a bigger subset of 200,000 observations (running for 4 hours) resulting in 81.3% accuracy on validation data. Eventually we ran the model on the test data containing more than 40,000 observations, which verified the model with even better accuracy of 81.5%. More on the resulting metrics:
Metric
Value
Accuracy
0.815724
RMSE
4366.43
MAE
2691.71
Average Price
13819.99
Although we achieved a nice accuracy of 81.5%, we can now observe some other metrics. Eg., having an RMSE almost twice higher than MAE suggests that there is a good number of observations where the error is big (the more RMSE differs from MAE, the higher is the variance) This is something we may want to improve by finding features and clusters in data space that introduce more variance in the predictions. Eg. the model predicting clean car price may greatly differ from the model predicting salvage (damage / total loss) car price. This comes from getting deeper expertise in the area, and we will try to play with this further more.
We may also want to use a different scoring function for our model - eg. some custom implementation of MSE of relative error, since we have high variance of price in the original dataset.
Lastly, due to time / resources limitations we only trained the model on half the training data, so we should try to run it on all training data and see how this changes our model. So far we have only seen improvements to the score as we increased the sample size.
Open Goals
Our open goals by the end of this project are:
train our model on all available training data (expected run time ~16 hrs)
perform thorough automated feature selection to identify features and / or data space clusters that degrade the model
try other custom scoring functions, possibly reflecting MSE of the relative error
create a command-line tool for the end-user to interactively request vehicle details and output expected price with a precision interval
Feature and Model Selection
As it was mentioned, our original data holds half a million observations with a few dozen features, most categorical, so accurate feature selection and model selection were extremely important. Especially because model training took significant amount of computational resources.
Since we could not efficiently use automated feature selection like RFE or FFE (because of time / resources constraint), we had to perform manual feature selection. As we had some intuition in the target area as well as some practical experience, we were able to prune our feature list to just 12 most important on our opinion:
Based on our EDA and assumptions, we picked a number of models to fit our train data. Since training and validating took a lot of resources, we performed it on a gradually increasing subsets of training data. See the results below, sorted by validation score (increasing):
For each model we performed 5-fold-cross-validated grid search involving a range of most important model-specific hyper-parameters.
Interpretation and Critique of the Results
Since our features are mostly categorical with numerous levels, we realized it will probably be hardly linearly separable, and best models would probably be kNN, ensemble methods (like random forests and boosted trees) and SVM RBF. And the results lined up with our expectations. We believe SVM RBF worked best because our data is highly clustered, and that's where this model performs great. Random / boosted forests were also expected to work well due to the same reason, but SVM still performed better eventually. The good sign was also that it did not overfit greatly on train set, which was a good sign to perform further testing.
Since SVM shown the best results from the very beginning, we performed a thorough adaptive grid search on a smaller subset of data and then validating the optimum parameters on a bigger subset of 200,000 observations (running for 4 hours) resulting in 81.3% accuracy on validation data. Eventually we ran the model on the test data containing more than 40,000 observations, which verified the model with even better accuracy of 81.5%. More on the resulting metrics:
Although we achieved a nice accuracy of 81.5%, we can now observe some other metrics. Eg., having an RMSE almost twice higher than MAE suggests that there is a good number of observations where the error is big (the more RMSE differs from MAE, the higher is the variance) This is something we may want to improve by finding features and clusters in data space that introduce more variance in the predictions. Eg. the model predicting
clean
car price may greatly differ from the model predictingsalvage
(damage / total loss) car price. This comes from getting deeper expertise in the area, and we will try to play with this further more.We may also want to use a different scoring function for our model - eg. some custom implementation of MSE of relative error, since we have high variance of price in the original dataset.
Lastly, due to time / resources limitations we only trained the model on half the training data, so we should try to run it on all training data and see how this changes our model. So far we have only seen improvements to the score as we increased the sample size.
Open Goals
Our open goals by the end of this project are: