Xtra-Computing / thundergbm

ThunderGBM: Fast GBDTs and Random Forests on GPUs
Apache License 2.0
695 stars 88 forks source link

the Random Forest classifies everything to be 1 #68

Open AlanSpencer2 opened 2 years ago

AlanSpencer2 commented 2 years ago

I am new to thundergbm, and just trying to get a simple Random Forest classifier going. But the classifier classifies every single sample to be 1. Not one single case out of 188244 samples is classified as 0. No other classifier behaves like this. I also tried different number of trees, depth etc. But it still classies everything to 1. Is there something wrong with the following code?

from thundergbm import TGBMClassifier clf = TGBMClassifier(depth=6, n_trees = 1, n_parallel_trees=100, bagging=1) clf.fit(X_train, y_train) y_pred = clf.predict(X_test)

y_pred classifies everything in the test set (X_test) to one.

Kurt-Liuhf commented 2 years ago

@AlanSpencer2 Hi, I used the classifier with the same parameters to fit the covtype data set from sklearn but I could not reproduce your results. The predictions seem to be correct. So it would be better if you could provide a subset of your data set. Thanks.

AlanSpencer2 commented 2 years ago

Hi, the problem occurs with binary classification. That is, if the target variable is 0 or 1, True or False. Can you please try a binary classification problem? (Not regression, and not multiple classification.)

Here is the Iris dataset with 3 different flower types. The target/label variable is 1 if the flower is Setosa, and 0 for the other 2 flower types: iris_data.csv

--------------------------------Python code------------------------------------ import pandas as pd import numpy as np from sklearn.model_selection import train_test_split from thundergbm import TGBMClassifier

df = pd.read_csv(r'C:\Python\iris_data.csv', encoding='ISO-8859-1', low_memory=False, index_col=0) df X=df[['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']] y=df['Label'] X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3) clf = TGBMClassifier(depth=6, n_trees = 1, n_parallel_trees=100, bagging=1) clf.fit(X_train, y_train) pred_test = clf.predict(X_test)

The predictions are never a mixture of 1s and 0s, but either all predictions are 0 or all predictions are 1.

--------------------------------end of code------------------------------------

ps. I have tried all kinds of different datasets. They all had the same issue.