ehrlinger / ggRandomForests

Graphical analysis of random forests with the randomForestSRC, randomForest and ggplot2 packages.
146 stars 29 forks source link

make vignette more reproducible #29

Open krz opened 8 years ago

krz commented 8 years ago

Hi, The Survival vignette is really looking good with lots of great plots, but I'm having problems reproducing many of the examples. For example, in the begging you mention that you prefer "years" to "days" in the pbc data set. Yet there is no code how you convert it. Doing a naive pbc$years <- pbc$days/365 I fail in the next part using the gg_survival function example. Next, there is no code for the very nice EDA plots in the vignette. I also could not get the 3D example in Appendix 1 to work. This line partial_time <- do.call(rbind,lapply(partial_pbc_time, gg_partial)) always produces errors. I always get errors when there is a theme() part in plot. There were some updates to randomForestSRC and ggplot recently that may cause a lot of these problems.

ehrlinger commented 8 years ago

The package was updated to be compatible with randomForestSRC 2.x and ggplot2 2.x recently. You do have to update all three packages simultaneously though, if you want to use the example data sets included in ggRandomForests (which the vignettes do use).

In order to keep the vignette length reasonable for the targeted journal, I do not include all the source code in the article. The vignette is compiled from the package source using RMarkdown/knitr, so recreating the examples should be straight forward if we extract the specifics from the knitr source in the vignettes/randomForestSRC-Survival.Rnw file.

1.) The correct conversions I use starting at line 196:

# Convert days to years
pbc$age <- pbc$age/364.24
pbc$years <- pbc$days/364.24

Additionally, you may need to format the treatment variable to create the gg_survival plot.

pbc <- pbc %>% select(-days)
pbc$treatment <- as.numeric(pbc$treatment)
pbc$treatment[which(pbc$treatment == 1)] <- "DPCA"
pbc$treatment[which(pbc$treatment == 2)] <- "placebo"
pbc$treatment <- factor(pbc$treatment)

2.) The EDA plot code starts at line 253 for categorical variables:

## Not displayed ##
# Use tidyr::gather to transform the data into long format.
cnt <- c(which(cls == "numeric" ), which(cls == "integer"))
fct <- setdiff(1:ncol(pbc), cnt)
fct <- c(fct, which(colnames(pbc) == "years"))
dta <- gather(pbc[,fct], variable, value, -years)

# plot panels for each covariate colored by the logical chas variable.
ggplot(dta, aes(x = years, fill = value)) +
  geom_histogram(color = "black", binwidth = 1) +
  labs(y = "", x = st.labs["years"]) +
  scale_fill_brewer(palette="RdBu",na.value = "white" ) +
  facet_wrap(~variable, scales = "free_y", nrow = 2) +
  theme(legend.position = "none")

and line 273 for continuous:

## Not displayed ##

# Use tidyr::gather to transform the data into long format.
cnt <- c(cnt, which(colnames(pbc) == "status"))
dta <- gather(pbc[,cnt], variable, value, -years, -status)

# plot panels for each covariate colored by the logical chas variable.
ggplot(dta, aes(x = years, y = value, color = status, shape = status)) +
  geom_point(alpha = 0.4) +
  geom_rug(data = dta[which(is.na(dta$value)),], color = "grey50") +
  labs(y = "", x = st.labs["years"], color = "Death", shape = "Death") +
  scale_color_manual(values = strCol) +
  scale_shape_manual(values = event.marks) +
  facet_wrap(~variable, scales = "free_y", ncol = 4) +
  theme(legend.position = c(0.8, 0.2))

3.) I'll update the appendix code, but it should be a straight copy of the code at line 738.

4.) I'd need to see the errors you get for theme() commands. That may be due to version mismatches with ggplot2.

I hope this helps.

krz commented 8 years ago

thanks for your answer. Now I see it. I was only reading this https://cran.r-project.org/web/packages/ggRandomForests/vignettes/randomForestSRC-Survival.pdf but you obviously have more code in the example sections of the vignette. The tutorial pdf should be usable as-is in my opinion.

ehrlinger commented 8 years ago

It would then require a journal article, and another alternate vignette. Which is probably what this will become as I move it forward.

On Fri, Jan 22, 2016 at 10:32 AM, Christoph notifications@github.com wrote:

thanks for your answer. Now I see it. I was only reading this

https://cran.r-project.org/web/packages/ggRandomForests/vignettes/randomForestSRC-Survival.pdf but you obviously have more code in the example sections of the vignette. The tutorial pdf should be usable as-is in my opinion.

— Reply to this email directly or view it on GitHub https://github.com/ehrlinger/ggRandomForests/issues/29#issuecomment-173952017 .

ehrlinger commented 8 years ago

If I wrote the vignettes correctly (I will need to verify that this might actually work), it might be extremely easy to get what you are requesting.

Go to the vignettes directory and edit the vignette you are interested in:

For randomForestSRC-Survival.Rnw go to the code block in line 83,

Change the code echo=FALSE to echo=TRUE

Recompile the vignette using the devtools::build_vignettes() command, or you can compile directly using knitr.

This will add at least 4 pages to the vignette. It will not be completely clear at this point, because I do some code replication within the document. (check for eval=FALSE statements). I will look at cleaning some of this up in the near future.

krz commented 8 years ago

thanks for your work and your great package!