Henrike-Schwenn / Predicting_bike_rental_demand

My first ai project as part of my take on the amazing online course "Introduction to Machine Learning for Coders" taught by Jeremy Howard. I will be contributing to the Kaggle competition "Bike Sharing Demand", aiming to predict bike rental demand depending on the weather.
3 stars 0 forks source link

Train Set #23

Closed Henrike-Schwenn closed 2 years ago

Henrike-Schwenn commented 3 years ago

To Do

Henrike-Schwenn commented 2 years ago

Document efficiency criteria

Runtime

Modul time in Python nutzen time.process_time_ns() method

Required computing capacity

How to get current CPU and RAM usage in Python?

Sample size

Henrike-Schwenn commented 2 years ago

Random Forest Regressor

Function

grafik

grafik

Fit method

My vague idea of what fit does:

Fitting and predicting: estimator basics

Fitting and predicting: estimator basics

Scikit-learn provides dozens of built-in machine learning algorithms and models, called estimators. Each estimator can be fitted to some data using its fit method.

Here is a simple example where we fit a RandomForestClassifier to some very basic data:

from sklearn.ensemble import RandomForestClassifier clf = RandomForestClassifier(random_state=0) X = [[ 1, 2, 3], # 2 samples, 3 features ... [11, 12, 13]] y = [0, 1] # classes of each sample clf.fit(X, y) RandomForestClassifier(random_state=0)

The fit method generally accepts 2 inputs:

The samples matrix (or design matrix) X. The size of X is typically (n_samples, n_features), which means that samples are represented as rows and features are represented as columns.

The target values y which are real numbers for regression tasks, or integers for classification (or any other discrete set of values). For unsupervized learning tasks, y does not need to be specified. y is usually 1d array where the i th entry corresponds to the target of the i th sample (row) of X.

Both X and y are usually expected to be numpy arrays or equivalent array-like data types, though some estimators work with other formats such as sparse matrices.

Once the estimator is fitted, it can be used for predicting target values of new data. You don’t need to re-train the estimator:

clf.predict(X) # predict classes of the training data array([0, 1]) clf.predict([[4, 5, 6], [14, 15, 16]]) # predict classes of new data array([0, 1])

print_score( )

An Introduction to Random Forest using the fastai Library (Machine Learning for Programmers – Part 1)

In order to compare the score against the train and test sets, the below function returns the RMSE value and score for both datasets.

def print_score(m): res = [rmse(m.predict(X_train), y_train), rmse(m.predict(X_valid), y_valid), m.score(X_train, y_train), m.score(X_valid, y_valid)] if hasattr(m, 'oobscore'): res.append(m.oobscore) print(res)

Henrike-Schwenn commented 2 years ago

Calculate score

AttributeError: module 'fastai' has no attribute 'print_score'

It seems like print_score is outdated.

What do I need?

MSLE

image

RF1.fit(X_train, Y_train)
x1=RF1.predict(X_validate)
print(x1)

[6.42740482 6.42740482 6.42740482 ... 6.42814087 6.42814087 6.42814087]

My first AI ever!!!

Flowchart_Prediction.odg

RF1 = sklearn.ensemble.RandomForestRegressor()
# Run fit method on training set
RF1.fit(X_train, Y_train)
# Predict y variable in validation set.
x1=RF1.predict(X_validate)
x2=Y_validate
score=math.sqrt(sklearn.metrics.mean_squared_log_error(x2, x1))
print(score)

ValueError: y_true and y_pred have different number of output (2!=1)

But they both do have the same number of elements, as they are supposed to:

print(x1)
[6.43182586 6.43182586 6.43182586 ... 6.43283502 6.43283502 6.43283502]
print(len(x1))
5443
print(len(x2))
5443
Henrike-Schwenn commented 2 years ago

Fix value error

Takeaway message: Look at your data!

Possible solutions

y_true=Y_validate.to_numpy()
type(y_true)
<class 'numpy.ndarray'>

Conversion has worked Conversion has no effect

x1 and x2 are different types

type(x1)
<class 'numpy.ndarray'>
type(x2)
<class 'pandas.core.frame.DataFrame'>

What a piece of code gives out

Description

I'm trying to run sklearn.metrics.mean_squared_log_error on two arrays y_true and y_pred. y_true is the column I removed from the dataframe beforehand, y_pred is the column I predicted using sklearn.ensemble.RandomForestRegressor().

# Read csv formatted evalidateFirstCycle and define as array X_validate. This is the name of the validation set.
X_validate = pandas.read_csv(
    "C:/Users/henri/OneDrive/Dokumente/Berufseinstieg/Sprachtechnologie/Predicting_Bike_Rental_Demand/FirstCycle/validateFirstCycle.csv")
# Read y_name_validation.csv and define as Y_validate. This is the dependent variable of the validation set, the column to be predicted.
Y_validate = pandas.read_csv(
    "C:/Users/henri/OneDrive/Dokumente/Berufseinstieg/Sprachtechnologie/Predicting_Bike_Rental_Demand/FirstCycle/y_name_validation.csv")

# Create sklearn.ensemble.RandomForestRegressor() object
RF1 = sklearn.ensemble.RandomForestRegressor()
# Run fit method on training set
RF1.fit(X_train, Y_train)
# Visualize RF1.fit using export_graphviz
# Extract a single tree from the forest
estimator=RF1.estimators_[5]
RF1_sample=sklearn.tree.export_graphviz(estimator,
                                 out_file="C:/Users/henri/OneDrive/Dokumente/Berufseinstieg/Sprachtechnologie"
                                                 "/Predicting_Bike_Rental_Demand/FirstCycle/RF1.dot")

# Predict y variable in validation set.
y_pred=RF1.predict(X_validate)
y_true=Y_validate.to_numpy()

score=sklearn.metrics.mean_squared_log_error(y_true, y_pred)
print(score)

Running the code produces the following error message:

ValueError: y_true and y_pred have different number of output (2!=1)

According to this page, Fix Exception, this error is caused by y_true and y_pred having different number labels. I made sure both are of the same data type, numpy.ndarray, and have the same number of elements.

Actually looking at the arrays has made it clear: y_true contains different numbers from y_pred.

print(y_true)
[[5.44300000e+03 4.26267988e+00]
 [5.44400000e+03 4.18965474e+00]
 [5.44500000e+03 3.36729583e+00]
 ...
 [1.08830000e+04 5.12396398e+00]
 [1.08840000e+04 4.85981240e+00]
 [1.08850000e+04 4.47733681e+00]]
print(y_pred)
[6.4309396  6.4309396  6.4309396  ... 6.43084709 6.43084709 6.43084709]

However, they both contain float64 numbers.

y_true.dtype
dtype('float64')
y_pred.dtype
dtype('float64')

The questions are:

y_true is Y_validate turned into an array. Y_validate is the column "rent_count" extracted from the dataframe dfFirstCycle_validate and saved into an array.

print(Y_validate)
      Unnamed: 0  rent_count
0           5443    4.262680
1           5444    4.189655
2           5445    3.367296
3           5446    3.663562
4           5447    2.484907
          ...         ...
5438       10881    5.817111
5439       10882    5.484797
5440       10883    5.123964
5441       10884    4.859812
5442       10885    4.477337
[5443 rows x 2 columns]
Henrike-Schwenn commented 2 years ago

Visualization

Visualizing data:

Visualizing RF decision tree:

Henrike-Schwenn commented 2 years ago

Visualize X_train and Y_train

Plot needs to display multiple x columns

What might work:

Henrike-Schwenn commented 2 years ago

Creare Dashboard

Henrike-Schwenn commented 2 years ago

Return parameters

Syntax error

  File "<input>", line 78
    print("Score RMSLE:", scoreRMSLE)
    ^
SyntaxError: invalid syntax
example="ggdr"
print("Example:", example)
Example: ggdr

But there's apparently nothing wrong with the syntax of the print command.

grafik

# Evaluate prediction using root mean squared log error
scoreRMSLE=math.sqrt(sklearn.metrics.mean_squared_log_error(y_true, y_pred))
Backend TkAgg is interactive backend. Turning interactive mode on.
print("Score RMSLE:", scoreRMSLE)
Score RMSLE: 0.35089926356805246

Apparently, in script mode, Python wants to run the print commands before everything else, so the don't work, because the variables haven't been defined yet.

Henrike-Schwenn commented 2 years ago

Try out code selections

# Train Random Forest Regressor on dataframe trainingFirstCycle
import os
import psutil
import math
import time
import matplotlib.pyplot
import pandas
import sklearn
import fastai
from sklearn.ensemble import RandomForestRegressor
from sklearn.datasets import make_regression
from sklearn.metrics import mean_squared_error
from sklearn.tree import export_graphviz
import plotly.express
import feather
import unittest
# Get the sum of the system and user CPU time of the current process in nanoseconds
# Nanoseconds since the epoch, which is 1 January 1601, 00:00:00 for Windows
# At the beginning of the process
start = time.process_time_ns()
# Read feather formatted trainingFirstCycle and define as array X_train. This is the training set
X_train = pandas.read_feather(
    "C:/Users/henri/OneDrive/Dokumente/Berufseinstieg/Sprachtechnologie/Predicting_Bike_Rental_Demand/FirstCycle/trainingFirstCycle.feather")
# Define X_train.rent_count as dependent variable Y_train. This is the dependent variable of the training set, the column to be predicted.
Y_train = X_train.rent_count
# Visualize X_train and Y_train in a scatterplot
# Some examples
Season=X_train.plot.scatter(x="season", y="rent_count")
Windspeed=X_train.plot.scatter(x="windspeed", y="rent_count")
Humidity=X_train.plot.scatter(x="humidity", y="rent_count")
Temperature=X_train.plot.scatter(x="temp", y="rent_count")
# Read csv formatted evalidateFirstCycle and define as array X_validate. This is the name of the validation set.
X_validate = pandas.read_csv(
    "C:/Users/henri/OneDrive/Dokumente/Berufseinstieg/Sprachtechnologie/Predicting_Bike_Rental_Demand/FirstCycle/validateFirstCycle.csv")
# Read y_name_validation.csv and define as Y_validate. This is the dependent variable of the validation set, the column to be predicted.
Y_validate = pandas.read_csv(
    "C:/Users/henri/OneDrive/Dokumente/Berufseinstieg/Sprachtechnologie/Predicting_Bike_Rental_Demand/FirstCycle/y_name_validation.csv")
# Create sklearn.ensemble.RandomForestRegressor() object
RF1 = sklearn.ensemble.RandomForestRegressor()
# Run fit method on training set
RF1.fit(X_train, Y_train)
# Visualize RF1.fit using export_graphviz
# https://scikit-learn.org/stable/modules/generated/sklearn.tree.export_graphviz.html
# Extract a single tree from the forest
estimator=RF1.estimators_[5]
RF1_sample=sklearn.tree.export_graphviz(estimator,
                                        out_file="C:/Users/henri/OneDrive/Dokumente/Berufseinstieg/Sprachtechnologie"
                                                 "/Predicting_Bike_Rental_Demand/FirstCycle/RF1.dot")
# Predict y variable in validation set.
y_pred=RF1.predict(X_validate)
# TODO Visualize prediction?
y_true=Y_validate.rent_count.to_numpy()
# Evaluate prediction using root mean squared log error
scoreRMSLE=math.sqrt(sklearn.metrics.mean_squared_log_error(y_true, y_pred))
# TODO Visualize score
# DataFrame.plot.scatter()
# matplotlib.scatter(x,y)
# Return further parameters
# Nanoseconds since the epoch
# At the end of the process
end = time.process_time_ns()
runtime = end-start
#Code runs fine up to this point.
# Return further parameters
# Get CPU usage
# How much cpu is used in total numbers
cpu_load = psutil.getloadavg()
# How many percent of the cpu are being used. Multiply by 100 in order to get a percent instead of a decimal.
cpu_usage = (cpu_load(os.cpu_count()) * 100
             # Nanoseconds since the epoch
             # At the end of the process
             end = time.process_time_ns()
runtime = end-start
  File "<input>", line 75
    end = time.process_time_ns()
    ^
SyntaxError: invalid syntax
# Get CPU usage
# How much cpu is used in total numbers
#cpu_load = psutil.getloadavg()
# How many percent of the cpu are being used. Multiply by 100 in order to get a percent instead of a decimal.
#cpu_usage = (cpu_load(os.cpu_count()) * 100
# Nanoseconds since the epoch
# At the end of the process
end = time.process_time_ns()
runtime = end-start
print(runtime)
6812500000

Putting the CPU usage statements before end = time.process_time_ns() produces a syntax error.

In fact, it's the cpu_usage = cpu_load(os.cpu_count()) * 100 that seems to cause a runtime error or is just unbearably slow. Therefore, it's not effective, so let's remove it.

    print("RAM memory used:", psutil.virtual_memory()[2]), "%")
                                                              ^
SyntaxError: unmatched ')'

Removing the CPU statements has solved the print("Score RMSLE:", scoreRMSLE) syntax error. Apparently, (cpu_load(os.cpu_count()) * 100 had missed a closing bracket, so the entire following code had been mistaken as part of this statement.

Syntax error unmatched ')' fixed by deleting a spare ).