Train Set - Githubissues

Henrike-Schwenn commented 3 years ago

To Do

[x] Create training file
[x] Make a list of parameters to be measured
[x] Also save validation set into file
[x] Read files and define variables
[x] Set up object RandomForestRegressor( ) and its methods fit( ), and print_score( )
[x] Set up an alternative for print_score
[x] Fix value error
[x] Fix output problem: Execute print statements in console/move them to their respective definitions?
[x] Check: After which line does the execution begin to slow down?
[x] Return parameters
[x] Select presentation format for GitHub
[x] Move all visualisation statements to console?
[x] Sketch out how the functions work: RandomForestRegressor( ), fit( ), and print_score( )

Henrike-Schwenn commented 2 years ago

Document efficiency criteria

Runtime

Modul time in Python nutzen time.process_time_ns() method

Return -- Runtime

Required computing capacity

How to get current CPU and RAM usage in Python?

Sample size

Henrike-Schwenn commented 2 years ago

Random Forest Regressor

Function

sklearn.ensemble.RandomForestRegressor

grafik

Fit method

My vague idea of what fit does:

It's an ML algorithm that searches for patterns in a data set
Searches for correlations between independent and dependent variables

Fitting and predicting: estimator basics

Fitting and predicting: estimator basics

Scikit-learn provides dozens of built-in machine learning algorithms and models, called estimators. Each estimator can be fitted to some data using its fit method.

Here is a simple example where we fit a RandomForestClassifier to some very basic data:

from sklearn.ensemble import RandomForestClassifier clf = RandomForestClassifier(random_state=0) X = [[ 1, 2, 3], # 2 samples, 3 features ... [11, 12, 13]] y = [0, 1] # classes of each sample clf.fit(X, y) RandomForestClassifier(random_state=0)

The fit method generally accepts 2 inputs:
The samples matrix (or design matrix) X. The size of X is typically (n_samples, n_features), which means that samples are represented as rows and features are represented as columns.

The target values y which are real numbers for regression tasks, or integers for classification (or any other discrete set of values). For unsupervized learning tasks, y does not need to be specified. y is usually 1d array where the i th entry corresponds to the target of the i th sample (row) of X.
Both X and y are usually expected to be numpy arrays or equivalent array-like data types, though some estimators work with other formats such as sparse matrices.

Once the estimator is fitted, it can be used for predicting target values of new data. You don’t need to re-train the estimator:

clf.predict(X) # predict classes of the training data array([0, 1]) clf.predict([[4, 5, 6], [14, 15, 16]]) # predict classes of new data array([0, 1])

print_score( )

An Introduction to Random Forest using the fastai Library (Machine Learning for Programmers – Part 1)

In order to compare the score against the train and test sets, the below function returns the RMSE value and score for both datasets.

def print_score(m): res = [rmse(m.predict(X_train), y_train), rmse(m.predict(X_valid), y_valid), m.score(X_train, y_train), m.score(X_valid, y_valid)] if hasattr(m, 'oobscore'): res.append(m.oobscore) print(res)

Henrike-Schwenn commented 2 years ago

Calculate score

AttributeError: module 'fastai' has no attribute 'print_score'

It seems like print_score is outdated.

What do I need?

Access X_validate and Y_validate.
Take the results of RF1.fit and use them to predict y for X_validate.
Calculate RSMLE between predicted y and actual y.

MSLE

RF1.fit(X_train, Y_train)
x1=RF1.predict(X_validate)
print(x1)

[6.42740482 6.42740482 6.42740482 ... 6.42814087 6.42814087 6.42814087]

My first AI ever!!!

Flowchart_Prediction.odg

RF1 = sklearn.ensemble.RandomForestRegressor()
# Run fit method on training set
RF1.fit(X_train, Y_train)
# Predict y variable in validation set.
x1=RF1.predict(X_validate)
x2=Y_validate
score=math.sqrt(sklearn.metrics.mean_squared_log_error(x2, x1))
print(score)

ValueError: y_true and y_pred have different number of output (2!=1)

But they both do have the same number of elements, as they are supposed to:

print(x1)
[6.43182586 6.43182586 6.43182586 ... 6.43283502 6.43283502 6.43283502]
print(len(x1))
5443
print(len(x2))
5443

Henrike-Schwenn commented 2 years ago

Fix value error

Takeaway message: Look at your data!

Possible solutions

Fix number labels
Fix Exception
y_true: Convert dataframe to array How to Convert Pandas DataFrame to NumPy Array

y_true=Y_validate.to_numpy()
type(y_true)
<class 'numpy.ndarray'>

Conversion has worked Conversion has no effect

x1 and x2 are different types

type(x1)
<class 'numpy.ndarray'>
type(x2)
<class 'pandas.core.frame.DataFrame'>

[x] Check definition of "output"

What a piece of code gives out

Description

I'm trying to run sklearn.metrics.mean_squared_log_error on two arrays y_true and y_pred. y_true is the column I removed from the dataframe beforehand, y_pred is the column I predicted using sklearn.ensemble.RandomForestRegressor().

# Read csv formatted evalidateFirstCycle and define as array X_validate. This is the name of the validation set.
X_validate = pandas.read_csv(
    "C:/Users/henri/OneDrive/Dokumente/Berufseinstieg/Sprachtechnologie/Predicting_Bike_Rental_Demand/FirstCycle/validateFirstCycle.csv")
# Read y_name_validation.csv and define as Y_validate. This is the dependent variable of the validation set, the column to be predicted.
Y_validate = pandas.read_csv(
    "C:/Users/henri/OneDrive/Dokumente/Berufseinstieg/Sprachtechnologie/Predicting_Bike_Rental_Demand/FirstCycle/y_name_validation.csv")

# Create sklearn.ensemble.RandomForestRegressor() object
RF1 = sklearn.ensemble.RandomForestRegressor()
# Run fit method on training set
RF1.fit(X_train, Y_train)
# Visualize RF1.fit using export_graphviz
# Extract a single tree from the forest
estimator=RF1.estimators_[5]
RF1_sample=sklearn.tree.export_graphviz(estimator,
                                 out_file="C:/Users/henri/OneDrive/Dokumente/Berufseinstieg/Sprachtechnologie"
                                                 "/Predicting_Bike_Rental_Demand/FirstCycle/RF1.dot")

# Predict y variable in validation set.
y_pred=RF1.predict(X_validate)
y_true=Y_validate.to_numpy()

score=sklearn.metrics.mean_squared_log_error(y_true, y_pred)
print(score)

Running the code produces the following error message:

ValueError: y_true and y_pred have different number of output (2!=1)

According to this page, Fix Exception, this error is caused by y_true and y_pred having different number labels. I made sure both are of the same data type, numpy.ndarray, and have the same number of elements.

Actually looking at the arrays has made it clear: y_true contains different numbers from y_pred.

print(y_true)
[[5.44300000e+03 4.26267988e+00]
 [5.44400000e+03 4.18965474e+00]
 [5.44500000e+03 3.36729583e+00]
 ...
 [1.08830000e+04 5.12396398e+00]
 [1.08840000e+04 4.85981240e+00]
 [1.08850000e+04 4.47733681e+00]]
print(y_pred)
[6.4309396  6.4309396  6.4309396  ... 6.43084709 6.43084709 6.43084709]

However, they both contain float64 numbers.

y_true.dtype
dtype('float64')
y_pred.dtype
dtype('float64')

The questions are:

What are these: 5.44300000e+03?
Where do they come from?: In addition to the actual elements of rent.count, Y_validate also contains the id numbers from the original dataframe dfFirstCycle.

How do I get rid of them?:

y_true=Y_validate.rent_count.to_numpy()
print(y_true)
[4.26267988 4.18965474 3.36729583 ... 5.12396398 4.8598124  4.47733681]

y_true is Y_validate turned into an array. Y_validate is the column "rent_count" extracted from the dataframe dfFirstCycle_validate and saved into an array.

print(Y_validate)
      Unnamed: 0  rent_count
0           5443    4.262680
1           5444    4.189655
2           5445    3.367296
3           5446    3.663562
4           5447    2.484907
          ...         ...
5438       10881    5.817111
5439       10882    5.484797
5440       10883    5.123964
5441       10884    4.859812
5442       10885    4.477337
[5443 rows x 2 columns]

Henrike-Schwenn commented 2 years ago

Visualization

Visualizing data:

Visualizing RF decision tree:

How to Visualize a Decision Tree from a Random Forest in Python using Scikit-Learn

Henrike-Schwenn commented 2 years ago

Visualize X_train and Y_train

Plot needs to display multiple x columns

What might work:

Henrike-Schwenn commented 2 years ago

Creare Dashboard

Library: Dash Builds HTML dashboards
This is How I Create Dazzling Dashboards Purely in Python

Henrike-Schwenn commented 2 years ago

Return parameters

Syntax error

  File "<input>", line 78
    print("Score RMSLE:", scoreRMSLE)
    ^
SyntaxError: invalid syntax

example="ggdr"
print("Example:", example)
Example: ggdr

But there's apparently nothing wrong with the syntax of the print command.

grafik

# Evaluate prediction using root mean squared log error
scoreRMSLE=math.sqrt(sklearn.metrics.mean_squared_log_error(y_true, y_pred))
Backend TkAgg is interactive backend. Turning interactive mode on.
print("Score RMSLE:", scoreRMSLE)
Score RMSLE: 0.35089926356805246

Apparently, in script mode, Python wants to run the print commands before everything else, so the don't work, because the variables haven't been defined yet.

Henrike-Schwenn commented 2 years ago

Try out code selections

# Train Random Forest Regressor on dataframe trainingFirstCycle
import os
import psutil
import math
import time
import matplotlib.pyplot
import pandas
import sklearn
import fastai
from sklearn.ensemble import RandomForestRegressor
from sklearn.datasets import make_regression
from sklearn.metrics import mean_squared_error
from sklearn.tree import export_graphviz
import plotly.express
import feather
import unittest
# Get the sum of the system and user CPU time of the current process in nanoseconds
# Nanoseconds since the epoch, which is 1 January 1601, 00:00:00 for Windows
# At the beginning of the process
start = time.process_time_ns()
# Read feather formatted trainingFirstCycle and define as array X_train. This is the training set
X_train = pandas.read_feather(
    "C:/Users/henri/OneDrive/Dokumente/Berufseinstieg/Sprachtechnologie/Predicting_Bike_Rental_Demand/FirstCycle/trainingFirstCycle.feather")
# Define X_train.rent_count as dependent variable Y_train. This is the dependent variable of the training set, the column to be predicted.
Y_train = X_train.rent_count
# Visualize X_train and Y_train in a scatterplot
# Some examples
Season=X_train.plot.scatter(x="season", y="rent_count")
Windspeed=X_train.plot.scatter(x="windspeed", y="rent_count")
Humidity=X_train.plot.scatter(x="humidity", y="rent_count")
Temperature=X_train.plot.scatter(x="temp", y="rent_count")
# Read csv formatted evalidateFirstCycle and define as array X_validate. This is the name of the validation set.
X_validate = pandas.read_csv(
    "C:/Users/henri/OneDrive/Dokumente/Berufseinstieg/Sprachtechnologie/Predicting_Bike_Rental_Demand/FirstCycle/validateFirstCycle.csv")
# Read y_name_validation.csv and define as Y_validate. This is the dependent variable of the validation set, the column to be predicted.
Y_validate = pandas.read_csv(
    "C:/Users/henri/OneDrive/Dokumente/Berufseinstieg/Sprachtechnologie/Predicting_Bike_Rental_Demand/FirstCycle/y_name_validation.csv")
# Create sklearn.ensemble.RandomForestRegressor() object
RF1 = sklearn.ensemble.RandomForestRegressor()
# Run fit method on training set
RF1.fit(X_train, Y_train)
# Visualize RF1.fit using export_graphviz
# https://scikit-learn.org/stable/modules/generated/sklearn.tree.export_graphviz.html
# Extract a single tree from the forest
estimator=RF1.estimators_[5]
RF1_sample=sklearn.tree.export_graphviz(estimator,
                                        out_file="C:/Users/henri/OneDrive/Dokumente/Berufseinstieg/Sprachtechnologie"
                                                 "/Predicting_Bike_Rental_Demand/FirstCycle/RF1.dot")
# Predict y variable in validation set.
y_pred=RF1.predict(X_validate)
# TODO Visualize prediction?
y_true=Y_validate.rent_count.to_numpy()
# Evaluate prediction using root mean squared log error
scoreRMSLE=math.sqrt(sklearn.metrics.mean_squared_log_error(y_true, y_pred))
# TODO Visualize score
# DataFrame.plot.scatter()
# matplotlib.scatter(x,y)
# Return further parameters
# Nanoseconds since the epoch
# At the end of the process
end = time.process_time_ns()
runtime = end-start
#Code runs fine up to this point.

# Return further parameters
# Get CPU usage
# How much cpu is used in total numbers
cpu_load = psutil.getloadavg()
# How many percent of the cpu are being used. Multiply by 100 in order to get a percent instead of a decimal.
cpu_usage = (cpu_load(os.cpu_count()) * 100
             # Nanoseconds since the epoch
             # At the end of the process
             end = time.process_time_ns()
runtime = end-start
  File "<input>", line 75
    end = time.process_time_ns()
    ^
SyntaxError: invalid syntax

# Get CPU usage
# How much cpu is used in total numbers
#cpu_load = psutil.getloadavg()
# How many percent of the cpu are being used. Multiply by 100 in order to get a percent instead of a decimal.
#cpu_usage = (cpu_load(os.cpu_count()) * 100
# Nanoseconds since the epoch
# At the end of the process
end = time.process_time_ns()
runtime = end-start
print(runtime)
6812500000

Putting the CPU usage statements before end = time.process_time_ns() produces a syntax error.

In fact, it's the cpu_usage = cpu_load(os.cpu_count()) * 100 that seems to cause a runtime error or is just unbearably slow. Therefore, it's not effective, so let's remove it.

    print("RAM memory used:", psutil.virtual_memory()[2]), "%")
                                                              ^
SyntaxError: unmatched ')'

Removing the CPU statements has solved the print("Score RMSLE:", scoreRMSLE) syntax error. Apparently, (cpu_load(os.cpu_count()) * 100 had missed a closing bracket, so the entire following code had been mistaken as part of this statement.

Syntax error unmatched ')' fixed by deleting a spare ).

Henrike-Schwenn / Predicting_bike_rental_demand

Train Set #23

To Do

Document efficiency criteria

Runtime

Required computing capacity

Sample size

Random Forest Regressor

Function

Fit method

print_score( )

Calculate score

Fix value error

Description

Visualization

Visualize X_train and Y_train

Creare Dashboard

Return parameters

Syntax error

Try out code selections