erblast / easyalluvial

create alluvial plots with a single line of code
https://erblast.github.io/easyalluvial/
108 stars 10 forks source link

not work on titanic dataset #13

Closed edvardoss closed 5 years ago

edvardoss commented 5 years ago

I’m impressed while reading your blog about model interpretation and try to test this package on popular dataset “titanic” but all my attemtions is failed.

install.packages("titanic") # only data in package
data("titanic_train",package="titanic")
library(tidyverse)
str(titanic_train)

d <- titanic_train %>% as_tibble %>%
  mutate(title=str_replace_all(string = Name, # extract title as general feature
                               pattern = "^[[:alpha:][:space:]'-]+,\\s+(the\\s)?(\\w+)\\..+",
                               replacement = "\\2")) %>%
  mutate(title=str_trim(title),
         title=case_when(title %in% c('Mlle','Ms')~'Miss', # normalize some titles
                         title=='Mme'~ 'Mrs',
                         title %in% c('Capt','Don','Major','Sir','Jonkheer', 'Col')~'Sir',
                         title %in% c('Dona', 'Lady', 'Countess')~'Lady',
                         TRUE~title)) %>%
  mutate(title=as_factor(title),
         Survived=factor(Survived,levels = c(0,1),labels=c("no","yes")),
         Sex=as_factor(Sex),
         Pclass=factor(Pclass,ordered = T)) %>%
  group_by(title) %>% # impute Age by median in current title
  mutate(Age=replace_na(Age,replace = median(Age,na.rm = T))) %>% ungroup
table(d$title,d$Sex) # look on title distribution        
caret::nearZeroVar(x = d,saveMetrics = T) # search and drop some unusefull features (PassengerId,Name,Ticket)
d <- d %>% select_at(vars(-c(PassengerId,Name,Ticket)))
d %>% summarise_all(~sum(is.na(.))) # control NAs

library(ranger)
m <- ranger(formula = Survived~.,data = d,mtry = 6,min.node.size = 5, num.trees = 600,
            importance = "permutation")

library(easyalluvial)
imp <- importance(m) %>% as.data.frame %>% tidy_imp(imp = .,df=d)
alluvial_wide(data = select(d,Survived,title,Pclass,Sex,Fare),fill_by = "first_variable") # ok, it work but i wont describe model (not describe data)

gds <- get_data_space(df = d,imp,degree = 4) # Error in Summary.factor(c(1L, 2L, 3L, 2L, 1L, 1L, 1L, 4L, 2L, 2L, 3L,  : ‘max’ not meaningful for factors

# ok, don`t  give up and try caret
library(caret)
trc <- trainControl(method = "none")
m <- train(Survived~.,data = d,method="rf",trControl=trc,importance=T)
alluvial_model_response_caret(train = m,degree = 4,bins=5,stratum_label_size = 2.8) # Error in tidy_imp(imp, df) : not all listed important variables found in input data
erblast commented 5 years ago

Thanks for reporting, I did not think to test with an all factor dataset. Will fix this as soon as possible

edvardoss commented 5 years ago

get_data_space now work, thank you! But next step - not.

library(ranger)
m <- ranger(formula = Survived~.,data = d,mtry = 6,min.node.size = 5, num.trees = 600,
            importance = "permutation")
library(easyalluvial)
imp <- importance(m) %>% as.data.frame %>% tidy_imp(imp = .,df=d)

dspace <- get_data_space(df = d,imp,degree = 4) # Work!
pred = predict(m, data = dspace)
p = alluvial_model_response(pred, dspace, imp, degree = 4) # Error in alluvial_model_response: "pred" needs to be a numeric or a factor vector
erblast commented 5 years ago

fixing some issues that arise when having character and factors in the training data eb74c372c7218ea393288cfe42e454bf91220fe4

erblast commented 5 years ago

Hi sorry Ia am not as frequently checking back on this as I would like to. So The problem is with predict in the ranger package it does not return pure predictions but returns some kind of list that needs to be indexed to get to the predictions.

try: p = alluvial_model_response(pred = pred$predictions, dspace = gds, imp = imp, degree = 4)

this works for me. Could you install the most recent development version? And tell me if it works for you now? Including the caret bit?.

Thanks for reporting this, it uncovered a few issues when using factors that I should have anticipated. I have added your example as a new test case. It will go to CRAN in the next two weeks hopefully.

edvardoss commented 5 years ago

Hi! Yes, i'm install latest dev.version. Sorry for ranger::predict - i am not properly checked this object type, thank you for your answer, its work well! But caret still generate error for me:

# ok, don`t  give up and try caret
devtools::install_local(path = "C:\\Users\\AnanevHA\\Downloads\\easyalluvial-master",force = TRUE)
library(caret)
trc <- trainControl(method = "none")
m <- train(Survived~.,data = d,method="rf",trControl=trc,importance=T)
library(easyalluvial)
alluvial_model_response_caret(train = m,degree = 4,bins=5,stratum_label_size = 2.8) # Error in tidy_imp(imp, df) : not all listed important variables found in input data
erblast commented 5 years ago

Could you make sure that you have the latest dev version installed devtools::install_github('https://github.com/erblast/easyalluvial.git')

When you execute easyalluvial::tidy_imp you see the function source code. You should find the following lines.

 # correct dummyvariable names back to original name

  df_ori_var = tibble( ori_var = names( select_if(df, ~ is.factor(.) | is.character(.) ) ) ) %>%

not the | is.character(.) was added. This should resolve the error you were getting. Let me know how it goes.

edvardoss commented 5 years ago

Hi! Everything is working, thank you!