Closed Henrike-Schwenn closed 2 years ago
Eike
Eike
Eike
How do I set the maximum line length in PyCharm?
Noch kein automatischer Zeilenumbruch beim Einfügen
Beim Tippen auch nicht
Zeigt aber Fehlermeldung: Zeile zu lang
"Fill the paragraph" hat keinen Effekt
Dieses Häkchen bringt es vielleicht
Zumindest bei Funktionen rückt PyCharm die Klammer ein
Interpreter kommt mit eingerückten Pfaden nicht zurecht
train_path = "C:/Users/henri/OneDrive/Dokumente/Berufseinstieg/Sprachtechnologie"
"Predicting_Bike_Rental_Demand/Datasets/train.csv"
Predicting_Bike_Rental_Demand/Datasets/train.csv"
IndentationError: unexpected indent
The preferred way of wrapping long lines is by using Python’s implied line continuation inside parentheses, brackets and braces. Long lines can be broken over multiple lines by wrapping expressions in parentheses. These should be used in preference to using a backslash for line continuation. Maximum Line Length(https://peps.python.org/pep-0008/#indentation)
train_path = ("C:/Users/henri/OneDrive/Dokumente/Berufseinstieg"
"/Sprachtechnologie/Predicting_Bike_Rental_Demand/Datasets"
"/train.csv")
Eike
# Dataframe Training Set
train_path = "C:/Users/henri/OneDrive/Dokumente/Berufseinstieg/Sprachtechnologie"
"/Predicting_Bike_Rental_Demand/Datasets/train.csv"
Eike
Function names should be lowercase, with words separated by underscores as necessary to improve readability.
Variable names follow the same convention as function names.
mixedCase is allowed only in contexts where that’s already the prevailing style (e.g. threading.py), to retain backwards compatibility.
Trainierten RF speichern
How to save and load Random Forest from Scikit-Learn in Python?
Save and Load Machine Learning Models in Python with scikit-learn
Mögliche Dateiformate:
pickle
joblib
Läuft
Time elapsed: 13437500000 nanoseconds
RAM memory used: 61.8 %
Leistung, wenn alle anderen Programme geschlossen sind
Instanz rf_2_trained
Python schnallt jedenfalls, dass das eine Instanz der Klasse sklearn.ensemble._forest.RandomForestRegressor
ist, auch wenn sie in einer anderen Datei definiert wurde. Sehr schön. :)
dir(rf_2_trained)
['__abstractmethods__', '__annotations__', '__class__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getitem__', '__getstate__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__iter__', '__le__', '__len__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__setstate__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', '_abc_impl', '_check_n_features', '_compute_partial_dependence_recursion', '_estimator_type', '_get_param_names', '_get_tags', '_make_estimator', '_more_tags', '_repr_html_', '_repr_html_inner', '_repr_mimebundle_', '_required_parameters', '_set_oob_score', '_validate_X_predict', '_validate_data', '_validate_estimator', '_validate_y_class_weight', 'apply', 'base_estimator', 'base_estimator_', 'bootstrap', 'ccp_alpha', 'class_weight', 'criterion', 'decision_path', 'estimator_params', 'estimators_', 'feature_importances_', 'fit', 'get_params', 'max_depth', 'max_features', 'max_leaf_nodes', 'max_samples', 'min_impurity_decrease', 'min_impurity_split', 'min_samples_leaf', 'min_samples_split', 'min_weight_fraction_leaf', 'n_estimators', 'n_features_', 'n_features_in_', 'n_jobs', 'n_outputs_', 'oob_score', 'predict', 'random_state', 'score', 'set_params', 'verbose', 'warm_start']
rf_2_trained.__class__
<class 'sklearn.ensemble._forest.RandomForestRegressor'>
Fehlermeldung Laufzeit
Log numpy.log(test_y_second_cycle.rent_count)
führt zur Division durch Null?
C:\Users\henri\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas\core\arraylike.py:364: RuntimeWarning: divide by zero encountered in log
result = getattr(ufunc, method)(*inputs, **kwargs)
print(test_y_second_cycle)
rent_count datetimeYear ... datetimeIs_year_start datetimeElapsed
0 -inf 2011 ... False 1.295482e+09
1 -inf 2011 ... False 1.295485e+09
2 -inf 2011 ... False 1.295489e+09
3 -inf 2011 ... False 1.295492e+09
4 -inf 2011 ... False 1.295496e+09
Deshalb. Alle y-Werte in der Datei sampleSubmission.csv sind 0. Kann irgendwie nicht stimmen.
datetime,rent_count 2011-01-20 00:00:00,0 2011-01-20 01:00:00,0 2011-01-20 02:00:00,0 2011-01-20 03:00:00,0 2011-01-20 04:00:00,0 2011-01-20 05:00:00,0 2011-01-20 06:00:00,0
Das ist auch nicht die y-Spalte vom Testset, sondern nur ein Muster, wie die vorhergesagten y-Werte in Kaggle hochgeladen werden sollen.
Value Error: ValueError: X has 21 features, but DecisionTreeRegressor is expecting 24 features as input.
Beim Ausführen von y_pred = rf_2_trained.predict(test_second_cycle)
Pandas – Number of Rows in a Dataframe
pandas: Get the number of rows, columns, all elements (size) of DataFrame)
Trainingset hat tatsächlich ca. 1,7-mal so viele Reihen wie Testset.
print(len(test_second_cycle.index))
6493
print(len(train_second_cycle.index))
10886
10886/6493
1.6765747728322808
Da haben wir das Problem: Das Trainingsset hat 2 x-Variablen mehr als das Testset.
print(len(test_second_cycle.columns))
21
train_second_cycle = add_datepart(train_second_cycle, "datetime", drop=True)
print(len(train_second_cycle.columns))
24
Nur im Trainingsset:
Lösung: Die beiden aus Trainingssets nehmen. Unschön, aber am einfachsten.
Immer noch eine zu viel.
ValueError: X has 21 features, but DecisionTreeRegressor is expecting 22 features as input.
add_datepart
?How to make predictions on test data set with no dependent variable?
How do I proceed with a dataset in machine learning if no dependent variable is there?
https://www.kaggle.com/competitions/bike-sharing-demand/overview/evaluation
Ich selbst kann den Score hier gar nicht berechnen, das hätten erst die Prüfer bei Kaggle gemacht. Soll wahrscheinlich Schummeln verhindern.
Spalten
datetime
y_pred
"Datetime" als array speichern
# Save column "datetime" for submission
datetime=Series.to_numpy(test_second_cycle.datetime)
type(datetime) <class 'numpy.ndarray'> type(y_pred) <class 'numpy.ndarray'>
submission = numpy.column_stack((datetime,y_pred))
TypeError: The DTypes <class 'numpy.dtype[float64]'> and <class 'numpy.dtype[datetime64]'> do not have a common DType. For example they cannot be stored in a single array unless the dtype is
object.
Lösung: Auf parse["datetime"] verzichten
Dafür Array wieder in Dataframe umwandeln
df_submission = pandas.DataFrame(submission, columns=["datetime", "count"])
Submission = df_submission.to_csv("C:/Users/henri/OneDrive/Dokumente/"
"Berufseinstieg/Sprachtechnologie/"
"Predicting_Bike_Rental_Demand"
"/Second Cycle/Submission.csv")
To Do