Leci37 / TensorFlow-stocks-prediction-Machine-learning-RealTime

Predict operation stocks points (buy-sell) with past technical patterns, and powerful machine-learning libraries such as: Sklearn.RandomForest , Sklearn.GradientBoosting, XGBoost, Google TensorFlow and Google TensorFlow LSTM..Real time Twitter:
241 stars 75 forks source link

Respond to Multidimensional arrays feeding in tensorflow #24

Closed Leci37 closed 10 months ago

Leci37 commented 10 months ago

Respond to https://stackoverflow.com/questions/40964817/multidimensional-arrays-feeding-in-tensorflow/75948534#75948534

To feed a network with multidimensional arrays in tensorflow Here is a good example of how to read a .csv (2d array, for one GT reference one time row).
Balance and SMOTE the ground true GT data and have TF process and train it as a

Very good simple Example: https://github.com/Leci37/stocks-prediction-Machine-learning-RealTime-telegram/blob/master/Tutorial/RUN_buy_sell_Tutorial_3W_5min_RT.py

3d data training, from .csv information For TF to fit() multidimensional 3D arrays requires , for example a code: Full code here: https://github.com/Leci37/stocks-prediction-Machine-learning-RealTime-telegram/blob/master/Model_train_TF_multi_onBalance.py train_features is a 3d array

#DATOS desequilibrados https://www.tensorflow.org/tutorials/structured_data/imbalanced_data
def train_TF_Multi_dimension_onBalance(multi_data: Data_multidimension, model_h5_name, model_type : _KEYS_DICT.MODEL_TF_DENSE_TYPE_MULTI_DIMENSI):

    # 1.0  LOAD the data with TF format (split, 3D, SMOTE and balance)
    array_aux_np, train_labels, val_labels, test_labels, train_features, val_features, test_features, bool_train_labels = multi_data.get_all_data()
    # TRAIN
    neg, pos = np.bincount(array_aux_np) #(df[Y_TARGET])
    initial_bias = np.log([pos / neg])

    # 2.0  STOP EARLY load and created a CustomEarlyStopping to avoid overfit
    resampled_steps_per_epoch = np.ceil(2.0 * neg / BATCH_SIZE)
    early_stopping = Utils_model_predict.CustomEarlyStopping(patience=8)

    # 3.0 TRAIN get the model from list of models objects and train
    model = multi_data.get_dicts_models_multi_dimension(model_type)
    model_history = model.fit(
          x=train_features, y=train_labels ,
          epochs=EPOCHS,
          steps_per_epoch=resampled_steps_per_epoch,
          callbacks=[early_stopping],  # callbacks=[early_stopping, early_stopping_board],
          validation_data=(val_features, val_labels),  verbose=0)

    # 3.1 Save the model to reuse into .h5 file
    model.save(MODEL_FOLDER_TF_MULTI + model_h5_name)
    print(" Save model Type MULTI TF: " + model_type.value +"  Path:  ", MODEL_FOLDER_TF_MULTI + model_h5_name)

    # 4.0 Eval de model with test_features this data had splited , and the .h5 model never see it
    predit_test  = model.predict(test_features).reshape(-1,)

To get the data in the correct 3d array format is required: (The example is from unbalanced GT data and contemplates the corrections balance and smote ) Full code here: https://github.com/Leci37/stocks-prediction-Machine-learning-RealTime-telegram/blob/master/Data_multidimension.py

def load_split_data_multidimension(self):
    df = Utils_model_predict.load_and_clean_DF_Train_from_csv(self.path_CSV, self.op_buy_sell, self.columns_selection) # 
    # SMOTE and Tomek links
    # The SMOTE oversampling approach could generate noisy samples since it creates synthetic data. To solve this problem, after SMOTE, we could use undersampling techniques to clean up. We’ll use the Tomek links undersampling technique in this example.
    # Utils_plotter.plot_2d_space(df.drop(columns=[Y_TARGET]).iloc[:,4:5] , df[Y_TARGET], path = "SMOTE_antes.png")
    array_aux_np = df[Y_TARGET]  # TODO antes o despues del balance??
    self.array_aux_np = array_aux_np
    print("1.0 ADD MULTIDIMENSION  Get 2D array , with BACHT_SIZE_LOOKBACK from "backward glances".")
    # Values go from (10000 rows, 10 columns ) to (10000 rows, ( 10-1[groundTrue] * 10 dimensions ) columns ) but for the moment it does not go to 3d array remains 2d.
    # df.shape: (1000, 10) to (1000, 90)
    arr_mul_labels, arr_mul_features = Utils_model_predict.df_to_df_multidimension_array_2D(df.reset_index(drop=True), BACHT_SIZE_LOOKBACK = self.BACHT_SIZE_LOOKBACK)
    shape_imput_3d = (-1,self.BACHT_SIZE_LOOKBACK, len(df.columns)-1) # (-1, 10, 12)
    print("1.1 validate the structure of the data, this can be improved by")
    arr_vali = arr_mul_features.reshape(shape_imput_3d) # 5077, 10, 12
    for i in range(1, arr_vali.shape[0], self.BACHT_SIZE_LOOKBACK * 3):
        list_fails_dates = [x for x in arr_vali[i][:, 0] if not (2018 <= datetime.fromtimestamp(x).year <= 2024)]
        if list_fails_dates:
            Logger.logr.error("The dates of the new 2D array do not appear in the first column. ")
            raise ValueError("The dates of the new 2D array do not appear in the first column. ")
    print("2.0 SCALER  scaling the data before, save a .scal file (it will be used to know how to scale the model for future predictions )")
    # Do I have to scale now or can I wait until after I split
    # You can scale between the following values _KEYS_DICT.MIN_SCALER, _KEYS_DICT.MAX_SCALER
    # " that you learn for your scaling so that doing scaling before or after may give you the same results (but this depends on the actual scaling function)."  https://datascience.stackexchange.com/questions/71515/should-i-scale-data-before-or-after-balancing-dataset
    # TODO verify the correct order to "scaler split and SMOTE" order SMOTE.  sure: SMOTE only aplay on train_df
    arr_mul_features =  Utils_model_predict.scaler_min_max_array(arr_mul_features,path_to_save= _KEYS_DICT.PATH_SCALERS_FOLDER+self.name_models_stock+".scal")
    arr_mul_labels = Utils_model_predict.scaler_min_max_array(arr_mul_labels.reshape(-1,1))
    print("2.1 Let's put real groound True Y_TARGET  in a copy of scaled dataset")
    df_with_target = pd.DataFrame(arr_mul_features)
    df_with_target[Y_TARGET] = arr_mul_labels.reshape(-1,)
    print("3.0 SPLIT Ok we should split in 3 train val and test")
    # "you divide your data first and then apply synthetic sampling SMOTE on the training data only" https://datascience.stackexchange.com/questions/15630/train-test-split-after-performing-smote
    # CAUTION SMOTE generates twice as many rows
    train_df, test_df = train_test_split(df_with_target, test_size=0.18, shuffle=self.will_shuffle) # Shuffle in a time series? hmmm
    train_df, val_df = train_test_split(train_df, test_size=0.35, shuffle=self.will_shuffle)  # Shuffle in a time series? hmmm
    # Be carefull not to touch test_df, val_df
    # Apply smote only to train_df but first remove Y_TARGET from train_df
    print("3.1 Create a array 2d form dfs . Remove Y_target from train_df, because that's we want to predict and that would be cheating")
    train_df_x = np.asarray(train_df.drop(columns=[Y_TARGET] ) )
    # In train_df_y We drop everything except Y_TARGET
    train_df_y = np.asarray(train_df[Y_TARGET] )
    print("4.0 SMOTE train_df to balance the data since there are few positive inputs, you have to generate "neighbors" of positive inputs. only in the df_train.") 
    # Now we can smote only train_df . Doing the smote with 2D, with 3D is not possible.
    X_smt, y_smt = Utils_model_predict.prepare_to_split_SMOTETomek_01(train_df_x, train_df_y)
    print("4.1 Let's put real groound True Y_TARGET  in a copy of scaled dataset")
    train_cleaned_df_target = pd.DataFrame(X_smt)
    train_cleaned_df_target[Y_TARGET] = y_smt.reshape(-1,)
    #the SMOTE leaves the positives very close together
    train_cleaned_df_target = shuffle(train_cleaned_df_target)
    print("5 PREPARE the data to be entered in TF with the correct dimensions")
    print("5.1 pass Y_TARGET labels to 2D array required for TF")
    train_labels = np.asarray(train_cleaned_df_target[Y_TARGET]).astype('float32').reshape((-1, 1)) # no need already 2d
    bool_train_labels = (train_labels != 0).reshape((-1))
    val_labels = np.asarray(val_df[Y_TARGET]).astype('float32').reshape((-1, 1)) # no need already 2d
    test_labels = np.asarray(test_df[Y_TARGET]).astype('float32').reshape((-1, 1)) # no need already 2d
    print("5.2 all array windows that were in 2D format (to overcome the SCALER and SMOTE methods),")
    # must be displayed in 3D for TF by format of varible shape_imput_3d
    train_features = np.array(train_cleaned_df_target.drop(columns=[Y_TARGET]) ).reshape(shape_imput_3d)
    test_features = np.array(test_df.drop(columns=[Y_TARGET]) ).reshape(shape_imput_3d )
    val_features = np.array(val_df.drop(columns=[Y_TARGET]) ).reshape(shape_imput_3d )
    print("6 DISPLAY show the df format before accessing TF")
    Utils_model_predict.log_shapes_trains_val_data(test_features, test_labels, train_features, train_labels, val_features, val_labels)

    self.imput_shape = (train_features.shape[1], train_features.shape[2])

    self.train_labels = train_labels
    self.val_labels = val_labels
    self.test_labels = test_labels
    self.train_features = train_features
    self.val_features = val_features
    self.test_features = test_features
    self.bool_train_labels = bool_train_labels

def get_all_data(self):
        return self.array_aux_np, self.train_labels, self.val_labels, self.test_labels, self.train_features, self.val_features, self.test_features, self.bool_train_labels