Classic ML algorithms summary

Multivariate linear regression

There seems to be no linearity between any feature and the occupancy, so this approach is discarded.

Decision tree

Decision trees are used for classification. There is a regression model as well, but multivariate is not supported. Therefore decision trees and also random forests are discarded.

Boosting algorithms

Gradient boosting

from sklearn.ensemble import GradientBoostingRegressor
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedKFold

## preprocess the df as with the other algorithms, one-hot encode categorical variables, extract X and y

model = GradientBoostingRegressor()
cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1)
n_scores = cross_val_score(model, X, y, scoring='neg_mean_absolute_error', cv=cv, n_jobs=-1, error_score='raise')
print('MAE: %.3f (%.3f)' % (np.mean(n_scores), np.std(n_scores)))
model.fit(X, y)

# predict with training set and obtain error
yhats = model.predict(X)
error = sum(abs(y - yhats))
print(f"Error={error/X.shape[0]}")
>> Error=11.795373060231238

# plot
fig = plt.figure()
ax = fig.gca()
ax.set_xticks(np.arange(0, 100, 10))
ax.set_yticks(np.arange(0, 100, 10))
plt.plot(y[:100])
plt.plot(yhats[:100])
plt.legend(['true', 'predicted'], loc='upper left')
plt.grid()
plt.show()

With the data as of now, this model creates an average error of 11 points per example. The picture shows the first 100 data values, the true output and the predicted output with this model. There is a prediction error in each data point, which doesn't make this model very usable.

XGBoost

from xgboost import XGBRegressor
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, shuffle=True)
model = XGBRegressor(objective='reg:squarederror')
model.fit(X_train, y_train)

yhats = model.predict(X_test)
error = sum(abs(y_test - yhats))
print(f"Error={error/X_train.shape[0]}")

With this model, the average error per point is 3.2 points, much better than the gradient boosting.

After splitting train and test data, the error was even lower. 1.05 on average for different splits

LightGBM

from lightgbm import LGBMRegressor
model = LGBMRegressor()

LightGBM achieves an average error per point of 7.6, more than double than XGBoost and not as bad as Gradient boosting.

CatBoost

# didn't one-hot encode the categorical variables

model = CatBoostClassifier(
    custom_loss=['Accuracy'],
    random_seed=42,
    logging_level='Silent'
)

model.fit(
    X, y,
    cat_features=categorical_features_indices,
    logging_level='Verbose',  # you can uncomment this for text output
    plot=True
);

The training took 2.5h to complete average error per example was 21.6. Using one-hot encoded parameters might speed up training and improve the error, but still the training is too long for such a small dataset.

Time series approach

TODO

anebz commented 3 years ago

Added some notes about gradient boosting, XGBoost and LightGBM