Closed AndreaPi closed 4 years ago
Unfortunately, this is by design. profile_missing
does only 1 thing, which is to profile and return missing values. Then further analysis is pushed to plot_missing
, so that you can either use the default grouping, or customized grouping (See #98 and code change).
However, if you would like to apply the grouping, feel free to wrap it around another function, which is the recommended way to approach this:
library(DataExplorer)
library(data.table)
profile_missing_with_group <- function(data) {
missing_value <- profile_missing(data.table(data))
missing_value[pct_missing < 0.05, group := "Good"]
missing_value[pct_missing >= 0.05 & pct_missing < 0.4, group := "OK"]
missing_value[pct_missing >= 0.4 & pct_missing < 0.8, group := "Bad"]
missing_value[pct_missing >= 0.8, group := "Remove"][]
missing_value
}
profile_missing_with_group(airquality)
# feature num_missing pct_missing group
# 1: Ozone 37 0.24183007 OK
# 2: Solar.R 7 0.04575163 Good
# 3: Wind 0 0.00000000 Good
# 4: Temp 0 0.00000000 Good
# 5: Month 0 0.00000000 Good
# 6: Day 0 0.00000000 Good
Thank you!
Actually, what I really need is the ggplot
object returned by plot_missing
, but not the plot, because I want to combine multiple such objects in a single plot using the patchwork
package. Thus I just modified plot_missing
by removing the part that does the actual plotting:
create_plot_missing_object <- function(data, group = list("Good" = 0.05, "OK" = 0.4, "Bad" = 0.8, "Remove" = 1), title = NULL, ggtheme = theme_gray(), theme_config = list("legend.position" = c("bottom"))) {
## Declare variable first to pass R CMD check
pct_missing <- Band <- NULL
## Profile missing values
missing_value <- data.table::data.table(profile_missing(data))
## Sort group based on value
group <- group[sort.list(unlist(group))]
invisible(lapply(seq_along(group), function(i) {
if (i == 1) {
missing_value[pct_missing <= group[[i]], Band := names(group)[i]]
} else {
missing_value[pct_missing > group[[i-1]] & pct_missing <= group[[i]], Band := names(group)[i]]
}
}))
## Create ggplot object
output <- ggplot(missing_value, aes_string(x = "feature", y = "num_missing", fill = "Band")) +
geom_bar(stat = "identity") +
geom_label(aes(label = paste0(round(100 * pct_missing, 2), "%"))) +
scale_fill_discrete("Band") +
coord_flip() +
xlab("Features") + ylab("Missing Rows")
}
However, I don't like this solution because ofc my code will diverge in time with respect to yours, as you keep developing plot_missing
. Could you instead add a flag to plot_missing, to choose whether to make the plot or only return the
ggplot` object? Something like:
plot_missing <- function(data, group = list("Good" = 0.05, "OK" = 0.4, "Bad" = 0.8, "Remove" = 1), title = NULL, ggtheme = theme_gray(), theme_config = list("legend.position" = c("bottom")), make_plot = TRUE) {
## Declare variable first to pass R CMD check
pct_missing <- Band <- NULL
## Profile missing values
missing_value <- data.table(profile_missing(data))
## Sort group based on value
group <- group[sort.list(unlist(group))]
invisible(lapply(seq_along(group), function(i) {
if (i == 1) {
missing_value[pct_missing <= group[[i]], Band := names(group)[i]]
} else {
missing_value[pct_missing > group[[i-1]] & pct_missing <= group[[i]], Band := names(group)[i]]
}
}))
## Create ggplot object
output <- ggplot(missing_value, aes_string(x = "feature", y = "num_missing", fill = "Band")) +
geom_bar(stat = "identity") +
geom_label(aes(label = paste0(round(100 * pct_missing, 2), "%"))) +
scale_fill_discrete("Band") +
coord_flip() +
xlab("Features") + ylab("Missing Rows")
if (make_plot) {
## Plot object
class(output) <- c("single", class(output))
plotDataExplorer(
plot_obj = output,
title = title,
ggtheme = ggtheme,
theme_config = theme_config
)
}
}
Maybe I am missing something, but the returned object is already a ggplot
object:
library(DataExplorer)
library(ggplot2)
## Set return object
out <- plot_missing(airquality)
## Check if return is ggplot object
is.ggplot(out)
# TRUE
## Make additional changes to the returned ggplot object
out + theme_minimal()
As I said before:
what I really need is the ggplot object returned by
plot_missing
, but not the plot,
plot_missing
doesn't simply return a ggplot
object. It also has the side effect of dawing a plot. In my case, I only need the ggplot
object, and I don't want any plot to be drawn to screen.
If you can add a flag to plot_missing
, controlling if (a ggplot
object is created and an actual plot is drawn) or (a ggplot
object is created but no plot is drawn), as in my example code above, this won't impact existing code, and will also satisfy my use case.
I am still having trouble following your ask. Your code returns nothing (if set make_plot
to FALSE
), which will give you NULL
if you assign it to something. Alternatively, if you set make_plot
to TRUE
, it is identical to the original function.
If all you want is to not plot but just return the object, why don't wrap invisible
around the function? Example:
out <- invisible(plot_missing(airquality))
is.ggplot(out)
# TRUE
It still returns the ggplot
object, but does not draw anything.
Your code returns nothing (if set make_plot to FALSE), which will give you NULL if you assign it to something.
Shouldn't it return output
, since it was the last assigned variable?
If all you want is to not plot but just return the object, why don't wrap invisible around the function?
Because I didn't know about invisible
😬I'll test it, and if it works, I'll close the issue. Thanks!!!
Nope! Not working. I just tried
> out <- invisible(plot_missing(airquality))
in RStudio, and this plot was generated:
invisible
is not preventing the plot to be drawn. And this is especially annoying in my use case, because I'm using plot_missing
multiple times in a RMarkdown document, thus I'm polluting it with a lot of plots which I don't want to be shown. I only need the corresponding ggplot
objects, which I then patch together in a single plot using the patchwork
package.
I confirm that using invisible
doesn't solve the issue. Instead, this modification to your code works and (when make_plot=FALSE
) it just returns the plot object, without drawing anything:
plot_missing <- function(data, group = list("Good" = 0.05, "OK" = 0.4, "Bad" = 0.8, "Remove" = 1), title = NULL, ggtheme = theme_gray(), theme_config = list("legend.position" = c("bottom")), make_plot = TRUE) {
## Declare variable first to pass R CMD check
pct_missing <- Band <- NULL
## Profile missing values
missing_value <- data.table::data.table(profile_missing(data))
## Sort group based on value
group <- group[sort.list(unlist(group))]
invisible(lapply(seq_along(group), function(i) {
if (i == 1) {
missing_value[pct_missing <= group[[i]], Band := names(group)[i]]
} else {
missing_value[pct_missing > group[[i-1]] & pct_missing <= group[[i]], Band := names(group)[i]]
}
}))
## Create ggplot object
output <- ggplot(missing_value, aes_string(x = "feature", y = "num_missing", fill = "Band")) +
geom_bar(stat = "identity") +
geom_label(aes(label = paste0(round(100 * pct_missing, 2), "%"))) +
scale_fill_discrete("Band") +
coord_flip() +
xlab("Features") + ylab("Missing Rows")
if (make_plot) {
## Plot object
class(output) <- c("single", class(output))
plotDataExplorer(
plot_obj = output,
title = title,
ggtheme = ggtheme,
theme_config = theme_config
)
}
return(output)
}
I think I understand your request finally, but there is no easy solution. All plotting functions are passed to individual S3 methods. In this case, plotDataExplorer.single
. Your code works for your case, but will break everything else, e.g., themes and title can no longer be customized, since these are wrapped inside that S3 method. create_report
will break on plot_missing
as well if themes/title needs to be customized.
I also realized why invisible
doesn't work because a print
is called explicitly here: https://github.com/boxuancui/DataExplorer/blob/master/R/plot.r#L90.
Now, I do not know your workflow using patchwork
or rmarkdown
, but I am sure there is a much easier way to ignore printed charts and knit ggplot objects later on. Alternatively, you can manually overwrite this function and use your version, as people have been doing it.
Now, I do not know your workflow using patchwork or rmarkdown, but I am sure there is a much easier way to ignore printed charts and knit ggplot objects later on.
Good suggestion: this works, and I don't have to overwrite your function, so I'm going to incur any regression. Thanks!
profile_missing()
is an excellent method which @boxuancui added toDataExplorer
to close issue https://github.com/boxuancui/DataExplorer/issues/78. I used it with satisfaction for some time, but recently it became broken for me.According to
?profile_missing
, the dataframe returned byprofile_missing()
should include a column (once calledGroup
) which contains the suggested action for each feature ("Good", "OK", "Bad", "Remove"
):https://github.com/boxuancui/DataExplorer/blob/master/R/profile_missing.r
This used to work, but now this column is not returned anymore and this is breaking my code:
Can you fix it? I need the returned dataframe, not a
ggplot
object, thusplot_missing
is unlikely to be of help.