H2O multiclass : Error in contribution[nrow(contribution), ] <- cumulative[nrow(contribution), : incorrect number of subscripts on matrix

boral commented 1 year ago

Trying to use DALEX on my data. Getting following error in line

pb_h2o_automl <- predict_parts(explainer_h2o_automl,new_observation = new_date_birth,type="break_down")

Error

Error in contribution[nrow(contribution), ] <- cumulative[nrow(contribution),  : 
  incorrect number of subscripts on matrix

Code

rm( list = ls() )

library(DALEX) ; library(h2o) ; library(DALEXtra) ; library(readxl) ; library(dplyr)
set.seed(17)

setwd( 'E:\\projects\\political_analysis' )

df0 = read_excel('training.xlsx')

df0$age = as.numeric( df0$age)

df1 <- df0[c("area", "district", "assembly_constituency", "gender", "age", "party_assembly_election_2018",
             "party_current_year_election", "chief_minister", "leader_vote_for_mla", "benefit_govt_scheme",
             "benefit_current_budget_scheme", "occupation", "education", "social_category", "caste", "caste_other",'party_upcoming_election')]

df1 <- df1 %>% mutate_all(~ifelse(is.na(.), as.character(names(which.max(table(na.omit(.))))), as.character(.))) %>% mutate_at(vars(-age), as.factor)

h2o.init()

target <- "party_upcoming_election"
df <- as.h2o(df1)

model_h2o_automl <- h2o.automl(y = target, training_frame = df, max_models = 5, max_runtime_secs = 600  )

leader_board <- h2o.get_leaderboard(model_h2o_automl)
head(leader_board)

test_df_0 = df1[1,]

explainer_h2o_automl <- DALEXtra::explain_h2o(model = model_h2o_automl, 
                                              data = test_df_0,
                                              y = test_df_0$party_upcoming_election,
                                              label = "h2o automl",
                                              colorize = T)

new_date_birth <- test_df_0 %>% select( - c('party_upcoming_election'))
pb_h2o_automl <- predict_parts(explainer_h2o_automl,new_observation = new_date_birth,type="break_down")

Have pasted first 50 rows of data here :

https://pastebin.com/C6ETyJbp

boral commented 1 year ago

@maksymiuks Any reason for marking it invalid ? I have provided a reproducible example of the error.

https://github.com/ModelOriented/DALEXtra/issues/86 might me a similar issue.

maksymiuks commented 1 year ago

@boral

It is just a mark for me that it does not feel accurate from the perspective of the package, nothing wrong with your report. I should be able to take a look at it today night

maksymiuks commented 1 year ago

Hi @boral

It looks like iBreakDown bug. I'm made a PR to the package to fix it https://github.com/ModelOriented/iBreakDown/pull/101

By the time it's merged, you can install iBreakDown from my fork to continue with your work https://github.com/maksymiuks/iBreakDown/tree/master

Secondly, your target variable has values that are not valid names for data.frame columns. This is not a problem in this case, but might be in other tasks, becasue when h2o (and in fact most of the ml tools in R) is generating probability data.frame, it uses values of y as column names to indicate which column is a probability of belonging to which class. However as mentioned before, among values of y there is Do Not Want to Answer meaning h2o transforms it to Do.Not.Want.to.Answer. Unfortunately, our tools for multiclass classification have no other way of accessing probabilities than extracting columns from y data.frame, and because names are different now than values of y, it results in an error.

As a side note, is that it makes little sense to pass a single row to the explainer and inherently predict_parts method. BreakDown estimates the decomposition of the prediction into particular contributions based on the entire dataset, it simply needs data points for that.

ModelOriented / DALEXtra

H2O multiclass : Error in contribution[nrow(contribution), ] <- cumulative[nrow(contribution), : incorrect number of subscripts on matrix #89