Henrike-Schwenn / Predicting_bike_rental_demand

My first ai project as part of my take on the amazing online course "Introduction to Machine Learning for Coders" taught by Jeremy Howard. I will be contributing to the Kaggle competition "Bike Sharing Demand", aiming to predict bike rental demand depending on the weather.
3 stars 0 forks source link

Second Cycle #57

Closed Henrike-Schwenn closed 2 years ago

Henrike-Schwenn commented 2 years ago

To Do

Henrike-Schwenn commented 2 years ago

Projektverzeichnis ordnen

Eike image

Henrike-Schwenn commented 2 years ago

Raus aus der venv?

Eike image

image

image

image

Henrike-Schwenn commented 2 years ago

Max. Zeilenlänge 80

Eike image

How do I set the maximum line length in PyCharm?

image

train_path = "C:/Users/henri/OneDrive/Dokumente/Berufseinstieg/Sprachtechnologie"
             "Predicting_Bike_Rental_Demand/Datasets/train.csv"
Predicting_Bike_Rental_Demand/Datasets/train.csv"
IndentationError: unexpected indent

The preferred way of wrapping long lines is by using Python’s implied line continuation inside parentheses, brackets and braces. Long lines can be broken over multiple lines by wrapping expressions in parentheses. These should be used in preference to using a backslash for line continuation. Maximum Line Length(https://peps.python.org/pep-0008/#indentation)

train_path = ("C:/Users/henri/OneDrive/Dokumente/Berufseinstieg"
              "/Sprachtechnologie/Predicting_Bike_Rental_Demand/Datasets"
              "/train.csv")
Henrike-Schwenn commented 2 years ago

Relative Pfade für Datensets

Eike image

# Dataframe Training Set
train_path = "C:/Users/henri/OneDrive/Dokumente/Berufseinstieg/Sprachtechnologie"
    "/Predicting_Bike_Rental_Demand/Datasets/train.csv"
Henrike-Schwenn commented 2 years ago

Einheitliche Varienschreibung

Eike image

Function and Variable Names

Function names should be lowercase, with words separated by underscores as necessary to improve readability.

Variable names follow the same convention as function names.

mixedCase is allowed only in contexts where that’s already the prevailing style (e.g. threading.py), to retain backwards compatibility.

Henrike-Schwenn commented 2 years ago

Skript "Train"

Henrike-Schwenn commented 2 years ago

Skript "Test"

Instanz rf_2_trained

Python schnallt jedenfalls, dass das eine Instanz der Klasse sklearn.ensemble._forest.RandomForestRegressor ist, auch wenn sie in einer anderen Datei definiert wurde. Sehr schön. :)

dir(rf_2_trained)
['__abstractmethods__', '__annotations__', '__class__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getitem__', '__getstate__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__iter__', '__le__', '__len__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__setstate__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', '_abc_impl', '_check_n_features', '_compute_partial_dependence_recursion', '_estimator_type', '_get_param_names', '_get_tags', '_make_estimator', '_more_tags', '_repr_html_', '_repr_html_inner', '_repr_mimebundle_', '_required_parameters', '_set_oob_score', '_validate_X_predict', '_validate_data', '_validate_estimator', '_validate_y_class_weight', 'apply', 'base_estimator', 'base_estimator_', 'bootstrap', 'ccp_alpha', 'class_weight', 'criterion', 'decision_path', 'estimator_params', 'estimators_', 'feature_importances_', 'fit', 'get_params', 'max_depth', 'max_features', 'max_leaf_nodes', 'max_samples', 'min_impurity_decrease', 'min_impurity_split', 'min_samples_leaf', 'min_samples_split', 'min_weight_fraction_leaf', 'n_estimators', 'n_features_', 'n_features_in_', 'n_jobs', 'n_outputs_', 'oob_score', 'predict', 'random_state', 'score', 'set_params', 'verbose', 'warm_start']
rf_2_trained.__class__
<class 'sklearn.ensemble._forest.RandomForestRegressor'>

Fehlermeldung Laufzeit

Log numpy.log(test_y_second_cycle.rent_count) führt zur Division durch Null?

C:\Users\henri\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas\core\arraylike.py:364: RuntimeWarning: divide by zero encountered in log
  result = getattr(ufunc, method)(*inputs, **kwargs)
print(test_y_second_cycle)
      rent_count  datetimeYear  ...  datetimeIs_year_start  datetimeElapsed
0           -inf          2011  ...                  False     1.295482e+09
1           -inf          2011  ...                  False     1.295485e+09
2           -inf          2011  ...                  False     1.295489e+09
3           -inf          2011  ...                  False     1.295492e+09
4           -inf          2011  ...                  False     1.295496e+09

Deshalb. Alle y-Werte in der Datei sampleSubmission.csv sind 0. Kann irgendwie nicht stimmen.

datetime,rent_count 2011-01-20 00:00:00,0 2011-01-20 01:00:00,0 2011-01-20 02:00:00,0 2011-01-20 03:00:00,0 2011-01-20 04:00:00,0 2011-01-20 05:00:00,0 2011-01-20 06:00:00,0

Das ist auch nicht die y-Spalte vom Testset, sondern nur ein Muster, wie die vorhergesagten y-Werte in Kaggle hochgeladen werden sollen.

Value Error: ValueError: X has 21 features, but DecisionTreeRegressor is expecting 24 features as input.

Beim Ausführen von y_pred = rf_2_trained.predict(test_second_cycle)

Pandas – Number of Rows in a Dataframe

pandas: Get the number of rows, columns, all elements (size) of DataFrame)

Trainingset hat tatsächlich ca. 1,7-mal so viele Reihen wie Testset.

print(len(test_second_cycle.index))
6493

print(len(train_second_cycle.index))
10886

10886/6493
1.6765747728322808

Da haben wir das Problem: Das Trainingsset hat 2 x-Variablen mehr als das Testset.

print(len(test_second_cycle.columns))
21
train_second_cycle = add_datepart(train_second_cycle, "datetime", drop=True)
print(len(train_second_cycle.columns))
24

Nur im Trainingsset:

Lösung: Die beiden aus Trainingssets nehmen. Unschön, aber am einfachsten.

Immer noch eine zu viel.

ValueError: X has 21 features, but DecisionTreeRegressor is expecting 22 features as input.

Henrike-Schwenn commented 2 years ago

Score ohne y-Werte?

Henrike-Schwenn commented 2 years ago

y_pred als csv speichern

Spalten

# Save column "datetime" for submission
datetime=Series.to_numpy(test_second_cycle.datetime)

pandas.Series.to_numpy

type(datetime) <class 'numpy.ndarray'> type(y_pred) <class 'numpy.ndarray'>

numpy.column_stack

submission = numpy.column_stack((datetime,y_pred))

TypeError: The DTypes <class 'numpy.dtype[float64]'> and <class 'numpy.dtype[datetime64]'> do not have a common DType. For example they cannot be stored in a single array unless the dtype isobject.

Lösung: Auf parse["datetime"] verzichten

Dafür Array wieder in Dataframe umwandeln

df_submission = pandas.DataFrame(submission, columns=["datetime", "count"])

Submission = df_submission.to_csv("C:/Users/henri/OneDrive/Dokumente/"
                                  "Berufseinstieg/Sprachtechnologie/"
                                  "Predicting_Bike_Rental_Demand"
                     "/Second Cycle/Submission.csv")

image