Closed Henrike-Schwenn closed 2 years ago
Modul time in Python nutzen time.process_time_ns() method
How to get current CPU and RAM usage in Python?
My vague idea of what fit does:
Fitting and predicting: estimator basics
Fitting and predicting: estimator basics
Scikit-learn provides dozens of built-in machine learning algorithms and models, called estimators. Each estimator can be fitted to some data using its fit method.
Here is a simple example where we fit a RandomForestClassifier to some very basic data:
from sklearn.ensemble import RandomForestClassifier clf = RandomForestClassifier(random_state=0) X = [[ 1, 2, 3], # 2 samples, 3 features ... [11, 12, 13]] y = [0, 1] # classes of each sample clf.fit(X, y) RandomForestClassifier(random_state=0)
The fit method generally accepts 2 inputs:
The samples matrix (or design matrix) X. The size of X is typically (n_samples, n_features), which means that samples are represented as rows and features are represented as columns. The target values y which are real numbers for regression tasks, or integers for classification (or any other discrete set of values). For unsupervized learning tasks, y does not need to be specified. y is usually 1d array where the i th entry corresponds to the target of the i th sample (row) of X.
Both X and y are usually expected to be numpy arrays or equivalent array-like data types, though some estimators work with other formats such as sparse matrices.
Once the estimator is fitted, it can be used for predicting target values of new data. You don’t need to re-train the estimator:
clf.predict(X) # predict classes of the training data array([0, 1]) clf.predict([[4, 5, 6], [14, 15, 16]]) # predict classes of new data array([0, 1])
In order to compare the score against the train and test sets, the below function returns the RMSE value and score for both datasets.
def print_score(m): res = [rmse(m.predict(X_train), y_train), rmse(m.predict(X_valid), y_valid), m.score(X_train, y_train), m.score(X_valid, y_valid)] if hasattr(m, 'oobscore'): res.append(m.oobscore) print(res)
AttributeError: module 'fastai' has no attribute 'print_score'
It seems like print_score is outdated.
What do I need?
RF1.fit(X_train, Y_train)
x1=RF1.predict(X_validate)
print(x1)
[6.42740482 6.42740482 6.42740482 ... 6.42814087 6.42814087 6.42814087]
My first AI ever!!!
RF1 = sklearn.ensemble.RandomForestRegressor()
# Run fit method on training set
RF1.fit(X_train, Y_train)
# Predict y variable in validation set.
x1=RF1.predict(X_validate)
x2=Y_validate
score=math.sqrt(sklearn.metrics.mean_squared_log_error(x2, x1))
print(score)
ValueError: y_true and y_pred have different number of output (2!=1)
But they both do have the same number of elements, as they are supposed to:
print(x1)
[6.43182586 6.43182586 6.43182586 ... 6.43283502 6.43283502 6.43283502]
print(len(x1))
5443
print(len(x2))
5443
Takeaway message: Look at your data!
Possible solutions
y_true=Y_validate.to_numpy()
type(y_true)
<class 'numpy.ndarray'>
Conversion has worked Conversion has no effect
x1 and x2 are different types
type(x1)
<class 'numpy.ndarray'>
type(x2)
<class 'pandas.core.frame.DataFrame'>
What a piece of code gives out
I'm trying to run sklearn.metrics.mean_squared_log_error on two arrays y_true and y_pred. y_true is the column I removed from the dataframe beforehand, y_pred is the column I predicted using sklearn.ensemble.RandomForestRegressor().
# Read csv formatted evalidateFirstCycle and define as array X_validate. This is the name of the validation set.
X_validate = pandas.read_csv(
"C:/Users/henri/OneDrive/Dokumente/Berufseinstieg/Sprachtechnologie/Predicting_Bike_Rental_Demand/FirstCycle/validateFirstCycle.csv")
# Read y_name_validation.csv and define as Y_validate. This is the dependent variable of the validation set, the column to be predicted.
Y_validate = pandas.read_csv(
"C:/Users/henri/OneDrive/Dokumente/Berufseinstieg/Sprachtechnologie/Predicting_Bike_Rental_Demand/FirstCycle/y_name_validation.csv")
# Create sklearn.ensemble.RandomForestRegressor() object
RF1 = sklearn.ensemble.RandomForestRegressor()
# Run fit method on training set
RF1.fit(X_train, Y_train)
# Visualize RF1.fit using export_graphviz
# Extract a single tree from the forest
estimator=RF1.estimators_[5]
RF1_sample=sklearn.tree.export_graphviz(estimator,
out_file="C:/Users/henri/OneDrive/Dokumente/Berufseinstieg/Sprachtechnologie"
"/Predicting_Bike_Rental_Demand/FirstCycle/RF1.dot")
# Predict y variable in validation set.
y_pred=RF1.predict(X_validate)
y_true=Y_validate.to_numpy()
score=sklearn.metrics.mean_squared_log_error(y_true, y_pred)
print(score)
Running the code produces the following error message:
ValueError: y_true and y_pred have different number of output (2!=1)
According to this page, Fix Exception, this error is caused by y_true and y_pred having different number labels. I made sure both are of the same data type, numpy.ndarray, and have the same number of elements.
Actually looking at the arrays has made it clear: y_true contains different numbers from y_pred.
print(y_true)
[[5.44300000e+03 4.26267988e+00]
[5.44400000e+03 4.18965474e+00]
[5.44500000e+03 3.36729583e+00]
...
[1.08830000e+04 5.12396398e+00]
[1.08840000e+04 4.85981240e+00]
[1.08850000e+04 4.47733681e+00]]
print(y_pred)
[6.4309396 6.4309396 6.4309396 ... 6.43084709 6.43084709 6.43084709]
However, they both contain float64 numbers.
y_true.dtype
dtype('float64')
y_pred.dtype
dtype('float64')
The questions are:
5.44300000e+03
?y_true=Y_validate.rent_count.to_numpy()
print(y_true)
[4.26267988 4.18965474 3.36729583 ... 5.12396398 4.8598124 4.47733681]
y_true is Y_validate turned into an array. Y_validate is the column "rent_count" extracted from the dataframe dfFirstCycle_validate and saved into an array.
print(Y_validate)
Unnamed: 0 rent_count
0 5443 4.262680
1 5444 4.189655
2 5445 3.367296
3 5446 3.663562
4 5447 2.484907
... ...
5438 10881 5.817111
5439 10882 5.484797
5440 10883 5.123964
5441 10884 4.859812
5442 10885 4.477337
[5443 rows x 2 columns]
Visualizing data:
Visualizing RF decision tree:
File "<input>", line 78
print("Score RMSLE:", scoreRMSLE)
^
SyntaxError: invalid syntax
example="ggdr"
print("Example:", example)
Example: ggdr
But there's apparently nothing wrong with the syntax of the print command.
# Evaluate prediction using root mean squared log error
scoreRMSLE=math.sqrt(sklearn.metrics.mean_squared_log_error(y_true, y_pred))
Backend TkAgg is interactive backend. Turning interactive mode on.
print("Score RMSLE:", scoreRMSLE)
Score RMSLE: 0.35089926356805246
Apparently, in script mode, Python wants to run the print commands before everything else, so the don't work, because the variables haven't been defined yet.
# Train Random Forest Regressor on dataframe trainingFirstCycle
import os
import psutil
import math
import time
import matplotlib.pyplot
import pandas
import sklearn
import fastai
from sklearn.ensemble import RandomForestRegressor
from sklearn.datasets import make_regression
from sklearn.metrics import mean_squared_error
from sklearn.tree import export_graphviz
import plotly.express
import feather
import unittest
# Get the sum of the system and user CPU time of the current process in nanoseconds
# Nanoseconds since the epoch, which is 1 January 1601, 00:00:00 for Windows
# At the beginning of the process
start = time.process_time_ns()
# Read feather formatted trainingFirstCycle and define as array X_train. This is the training set
X_train = pandas.read_feather(
"C:/Users/henri/OneDrive/Dokumente/Berufseinstieg/Sprachtechnologie/Predicting_Bike_Rental_Demand/FirstCycle/trainingFirstCycle.feather")
# Define X_train.rent_count as dependent variable Y_train. This is the dependent variable of the training set, the column to be predicted.
Y_train = X_train.rent_count
# Visualize X_train and Y_train in a scatterplot
# Some examples
Season=X_train.plot.scatter(x="season", y="rent_count")
Windspeed=X_train.plot.scatter(x="windspeed", y="rent_count")
Humidity=X_train.plot.scatter(x="humidity", y="rent_count")
Temperature=X_train.plot.scatter(x="temp", y="rent_count")
# Read csv formatted evalidateFirstCycle and define as array X_validate. This is the name of the validation set.
X_validate = pandas.read_csv(
"C:/Users/henri/OneDrive/Dokumente/Berufseinstieg/Sprachtechnologie/Predicting_Bike_Rental_Demand/FirstCycle/validateFirstCycle.csv")
# Read y_name_validation.csv and define as Y_validate. This is the dependent variable of the validation set, the column to be predicted.
Y_validate = pandas.read_csv(
"C:/Users/henri/OneDrive/Dokumente/Berufseinstieg/Sprachtechnologie/Predicting_Bike_Rental_Demand/FirstCycle/y_name_validation.csv")
# Create sklearn.ensemble.RandomForestRegressor() object
RF1 = sklearn.ensemble.RandomForestRegressor()
# Run fit method on training set
RF1.fit(X_train, Y_train)
# Visualize RF1.fit using export_graphviz
# https://scikit-learn.org/stable/modules/generated/sklearn.tree.export_graphviz.html
# Extract a single tree from the forest
estimator=RF1.estimators_[5]
RF1_sample=sklearn.tree.export_graphviz(estimator,
out_file="C:/Users/henri/OneDrive/Dokumente/Berufseinstieg/Sprachtechnologie"
"/Predicting_Bike_Rental_Demand/FirstCycle/RF1.dot")
# Predict y variable in validation set.
y_pred=RF1.predict(X_validate)
# TODO Visualize prediction?
y_true=Y_validate.rent_count.to_numpy()
# Evaluate prediction using root mean squared log error
scoreRMSLE=math.sqrt(sklearn.metrics.mean_squared_log_error(y_true, y_pred))
# TODO Visualize score
# DataFrame.plot.scatter()
# matplotlib.scatter(x,y)
# Return further parameters
# Nanoseconds since the epoch
# At the end of the process
end = time.process_time_ns()
runtime = end-start
#Code runs fine up to this point.
# Return further parameters
# Get CPU usage
# How much cpu is used in total numbers
cpu_load = psutil.getloadavg()
# How many percent of the cpu are being used. Multiply by 100 in order to get a percent instead of a decimal.
cpu_usage = (cpu_load(os.cpu_count()) * 100
# Nanoseconds since the epoch
# At the end of the process
end = time.process_time_ns()
runtime = end-start
File "<input>", line 75
end = time.process_time_ns()
^
SyntaxError: invalid syntax
# Get CPU usage
# How much cpu is used in total numbers
#cpu_load = psutil.getloadavg()
# How many percent of the cpu are being used. Multiply by 100 in order to get a percent instead of a decimal.
#cpu_usage = (cpu_load(os.cpu_count()) * 100
# Nanoseconds since the epoch
# At the end of the process
end = time.process_time_ns()
runtime = end-start
print(runtime)
6812500000
Putting the CPU usage statements before end = time.process_time_ns()
produces a syntax error.
In fact, it's the cpu_usage = cpu_load(os.cpu_count()) * 100
that seems to cause a runtime error or is just unbearably slow. Therefore, it's not effective, so let's remove it.
print("RAM memory used:", psutil.virtual_memory()[2]), "%")
^
SyntaxError: unmatched ')'
Removing the CPU statements has solved the print("Score RMSLE:", scoreRMSLE)
syntax error. Apparently, (cpu_load(os.cpu_count()) * 100
had missed a closing bracket, so the entire following code had been mistaken as part of this statement.
Syntax error unmatched ')'
fixed by deleting a spare ).
To Do