egenn / rtemis

Advanced Machine Learning and Visualization
https://rtemis.org
GNU General Public License v3.0
137 stars 19 forks source link

dplot3.addtree: Error: syntax error in line 13 near '"' #24

Closed jonas-sk closed 4 years ago

jonas-sk commented 4 years ago

Playing around with the package a bit and after creating an AddTree, the visualization won't work:

df.tree <- s.ADDTREE(df, gamma = 5, learning.rate = 0.1, upsample = TRUE)
dplot3.addtree(df.tree)

Results in the following error in the plot window (not the console)

Error: syntax error in line 13 near '"'

As a side note, I don't necessarily need to visualize the model using an interactive HTML graph. Are there any other tree visualization functions that can be used for AddTrees?

egenn commented 4 years ago

Hi, this is likely caused by missing dependencies.

I added a dependency check for data.tree and DiagrammeR: 81406e3b04d0fc80fa4f83f9dc1700d979b4be88

The graph is a pretty lightweight static graph with simple tooltips on mouseover. If you are using RStudio, it's not very different from a base R plot. You can export as image or PDF.

I also added a filename option to save to PDF. If provided, it will also check for packages DiagrammeRsvg and rsvg.

egenn commented 4 years ago

Actually - this is likely due to a problem with DiagrammeR. I will be rewriting the function to use a different graphing library. If this is not private data that you can share, I can look into it. I have had a small number of trees fail to plot with the current implementation.

jonas-sk commented 4 years ago

Thank you for the quick answer! Dependencies are installed and I didn't get an error after updating rtemis and re-running the code. (I also tried replacing -77, -88, -99 with 77, 88, 99 for the special values but that didn't help either.)

The dataset can be downloaded here: https://send.firefox.com/download/964cbc37abfa3eb8/#idJx7pmRd1x-B9Z6H2kwiA (link expires in 24h). Feel free to play around with it.

(Sidenote: I excluded three variables in the dataset that could be seen as steps towards the outcome variable (if one of them is 1, the result will automatically be 1, but each of them relates to a different group of variables. Is there any way to include this group/intermediate variable effect in the tree?)

Are there any other packages that can visualize s.ADDTREE's tree (or tree$mod$addtree.pruned for example) out of the box?

jonas-sk commented 4 years ago

To provide a reproducible example here as well:

cases_test.tree <- read_csv("cases_test.csv") %>% 
  preprocess(numeric2factor = TRUE) %>% 
  s.ADDTREE()

dplot3.addtree(cases_test.tree)
plot(cases_test.tree$mod$addtree.pruned)

Results in the following outputs for the last two commands:

Error: syntax error in line 13 near '"'

Error: syntax error in line 7 near '"'
egenn commented 4 years ago

I removed one of the DiagrammeR lines that seems buggy and this works for me, make sure to get the latest commit.

jonas-sk commented 4 years ago

Thank you! That worked for me. However, I still get the error most of the time when using elevate (sorry that it might take ~1min to run):

cases_test.tree <- read_csv("cases_test.csv") %>% 
  preprocess(numeric2factor = TRUE) %>% 
  elevate(mod = "addtree", resampler = "kfold", n.resamples = 2,
          gamma = c(0.5, 1))

dplot3.addtree(cases_test.tree$mod$elevate.ADDTREE.repeat1$ADDTREE1$mod1)

(I don't think this specific case is necessarily a big issue because for prediction/display you would probably retrain the modle on the full sample. However, it points to the fact that there is still something broken.)

I have a side question I hope you could answer: I have problems understanding howevaluate() performs the nested cross-validation. My understanding is that I resample the outer resample for hyperparameter tuning, resulting in e.g. 10 model trainings for two gamma values. I then can pick the more accurate model (which has certain specific hyperparameters), retrain it against the training sample of the outer resample and then test it against its testing sample. However, judging by the output, the hyperparameter-optimized model is never retrained on the full training sample (of the outer resample). Rather, it is just tested against the test data (of the outer resample) without retraining, correct? I am wondering because cases_test.tree$mod$elevate.ADDTREE.repeat1 lists both hyperparameters.

egenn commented 4 years ago
  1. I ran the code in your example 10 times with no problem. Make sure the code is up to date and start with a clear environment, etc.

  2. elevate always retrains on the full training set after identifying best combination of tuning parameters.

cases_test.tree$mod$elevate.ADDTREE.repeat1$ADDTREE1$params contains the range of values searched by gridSearchLearn. In your example, this lists two gamma values.

cases_test.tree$mod$elevate.ADDTREE.repeat1$ADDTREE1$mod1$parameters includes the parameters used in this specific model after grid search. This lists a single gamma value.

An upcoming feature will move gridSearch results to a new position in the R6 object and include a verbose summary/print function to clearly display the info saved currently under cases_test.tree$mod$elevate.ADDTREE.repeat1$ADDTREE1$mod1$extra$gridSearch$best.tune This info is currently printed to console every time you tune a single model.

jonas-sk commented 4 years ago

Very helpful and fantastic answer! Thank you for your help.

Regarding the bug, I will check again and come back to you if I find a more reproducible example.