gbm-developers / gbm

Gradient boosted models (the old gbm package)
Other
51 stars 27 forks source link

GBM Producing different predictions on 2 different servers #55

Closed meet1704 closed 8 months ago

meet1704 commented 4 years ago

Hello Team,

We are running GBM model in 2 different servers, with exact same R Version and GBM version. We are trying to predict on exact same data, but GBM is producing different predictions for some of the records. We are not using external label-encoding before doing prediction, but depending on internal label-encoding of the package.

This is Distribution of Predicted variable in server 1, which is to be read as follows, 1821 records predicted in 0 bucket, 14236 in 1 bucket and similarly. table(temp$Prediction)

0     1     2     3     4     5     6     7     8     9    10    11    12    13    14    15    16    17    18 

1821 14236 21582 12316 18035 6724 12986 4713 4908 8167 893 672 84 216 554 557 1285 205 27

19 20 21 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 69 45 207 54 7 176 30 653 22 6 97 123 1 76 3096 54 143 27 38

39 41 42 43 47 48 99999 18 56 267 242 182 31 4244

Prediction distribution on server 2 table(temp$Prediction) 0 1 2 3 4 5 6 7 8 9 10 11 12 1826 14208 21743 12175 18037 6744 13913 3742 4890 8178 923 676 82

13 14 15 16 17 18 19 20 21 23 24 25 26 216 583 529 1284 205 27 69 67 185 54 7 176 30

27 28 29 30 31 32 33 34 35 36 37 38 39 653 22 6 97 16 108 76 3096 54 143 27 38 18

41 42 43 47 48 99999 56 267 242 182 31 4244

bgreenwell commented 4 years ago

Hey @meet1704, did you set the seed before fitting the two models? GBM, by default, samples the data in various ways before fitting each tree in the sequence. Also, without a reproducible example (or the code used to fit and deploy the model(s)), it is rather difficult to diagnose the issue.

meet1704 commented 4 years ago

Hey @bgreenwell ,

I have seeded the code before prediction. I was not able to replicate the issue with any toy code, so putting actual case. I have put, my model file - model.Rda and test_data in test_record.Rda in below link. https://github.com/meet1704/GBM_issue

Here is my code snippet -

load("test_record.Rda") if(nrow(Analytical_ML)>0) { Analytical_ML$model_name<-paste("GBM_Model2",AnalyticalML[,25],"",AnalyticalML[,26],"",Analytical_ML[,12],sep="") Analytical_ML$model_tree<-paste("gbmtree2",AnalyticalML[,25],"",AnalyticalML[,26],"",Analytical_ML[,12],sep="") country_combinations<-data.frame(unique(Analytical_ML$model_name)) Prediction = NULL temp <- NULL for(t in 1:nrow(country_combinations)) { set.seed(1234)

Check Here

Analytical_test_CC<-Analytical_ML[Analytical_ML$model_name %in% country_combinations[t,],]
model<-paste("GBM_Model_2_",Analytical_test_CC[1,25],"_",Analytical_test_CC[1,26],"_",Analytical_test_CC[1,12],sep="")
tree<-paste("gbmtree_2_",Analytical_test_CC[1,25],"_",Analytical_test_CC[1,26],"_",Analytical_test_CC[1,12],sep="")
Prediction <- tryCatch({floor(predict.gbm(get(model),Analytical_test_CC,n.trees = get(tree)))},error=function(e){99999})
print(Prediction)
Prediction[Prediction < 0 & Prediction != ""] <- 0
Analytical_test_CC<-cbind(Analytical_test_CC,Prediction)
temp <- rbind(temp, Analytical_test_CC)

} }

temp$Prediction

The 3 records going to both the servers are exact same and the factor levels for the records are properly synced in both the servers. Moreover, we have around 100K predictions made, but the difference is seen in around 0.5% records only, without any pattern.

############ Results - server1 - Predictions - Server config - R3.4.1 - gbm 2.1.3 Prediction 729286 147 730285 147 731766 147

############### Results - server2 - Predictions - Server config - R3.4.1 - gbm 2.1.3 Prediction 729286 144 730285 144 731766 144

Thanks in advance !!!!