Closed anebz closed 3 years ago
There seems to be no linearity between any feature and the occupancy, so this approach is discarded.
Decision trees are used for classification. There is a regression model as well, but multivariate is not supported. Therefore decision trees and also random forests are discarded.
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedKFold
## preprocess the df as with the other algorithms, one-hot encode categorical variables, extract X and y
model = GradientBoostingRegressor()
cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1)
n_scores = cross_val_score(model, X, y, scoring='neg_mean_absolute_error', cv=cv, n_jobs=-1, error_score='raise')
print('MAE: %.3f (%.3f)' % (np.mean(n_scores), np.std(n_scores)))
model.fit(X, y)
# predict with training set and obtain error
yhats = model.predict(X)
error = sum(abs(y - yhats))
print(f"Error={error/X.shape[0]}")
>> Error=11.795373060231238
# plot
fig = plt.figure()
ax = fig.gca()
ax.set_xticks(np.arange(0, 100, 10))
ax.set_yticks(np.arange(0, 100, 10))
plt.plot(y[:100])
plt.plot(yhats[:100])
plt.legend(['true', 'predicted'], loc='upper left')
plt.grid()
plt.show()
With the data as of now, this model creates an average error of 11 points per example. The picture shows the first 100 data values, the true output and the predicted output with this model. There is a prediction error in each data point, which doesn't make this model very usable.
from xgboost import XGBRegressor
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, shuffle=True)
model = XGBRegressor(objective='reg:squarederror')
model.fit(X_train, y_train)
yhats = model.predict(X_test)
error = sum(abs(y_test - yhats))
print(f"Error={error/X_train.shape[0]}")
With this model, the average error per point is 3.2 points, much better than the gradient boosting.
After splitting train and test data, the error was even lower. 1.05 on average for different splits
from lightgbm import LGBMRegressor
model = LGBMRegressor()
LightGBM achieves an average error per point of 7.6, more than double than XGBoost and not as bad as Gradient boosting.
# didn't one-hot encode the categorical variables
model = CatBoostClassifier(
custom_loss=['Accuracy'],
random_seed=42,
logging_level='Silent'
)
model.fit(
X, y,
cat_features=categorical_features_indices,
logging_level='Verbose', # you can uncomment this for text output
plot=True
);
The training took 2.5h to complete average error per example was 21.6. Using one-hot encoded parameters might speed up training and improve the error, but still the training is too long for such a small dataset.
TODO
Added some notes about gradient boosting, XGBoost and LightGBM
Added the experiments documentation to the wiki
Classical ML
Boosting algorithms?
Other algorithms
Each method might need its own preprocessing!