boxuancui / DataExplorer

Automate Data Exploration and Treatment
http://boxuancui.github.io/DataExplorer/
Other
512 stars 88 forks source link

profile_missing doesn't return a Group column (suggested action) anymore #132

Closed AndreaPi closed 4 years ago

AndreaPi commented 4 years ago

profile_missing() is an excellent method which @boxuancui added to DataExplorer to close issue https://github.com/boxuancui/DataExplorer/issues/78. I used it with satisfaction for some time, but recently it became broken for me.

According to ?profile_missing, the dataframe returned byprofile_missing() should include a column (once called Group) which contains the suggested action for each feature ("Good", "OK", "Bad", "Remove"):

https://github.com/boxuancui/DataExplorer/blob/master/R/profile_missing.r

#' Profile missing values
#'
#' Analyze missing value profile
#' @param data input data
#' @keywords profile_missing
#' @return missing value profile, such as frequency, percentage and suggested action.
#' @import data.table
#' @export profile_missing
#' @seealso \link{plot_missing}
#' @examples
#' profile_missing(airquality)

This used to work, but now this column is not returned anymore and this is breaking my code:

library(DataExplorer)
profile_missing(airquality)
#>   feature num_missing pct_missing
#> 1   Ozone          37  0.24183007
#> 2 Solar.R           7  0.04575163
#> 3    Wind           0  0.00000000
#> 4    Temp           0  0.00000000
#> 5   Month           0  0.00000000
#> 6     Day           0  0.00000000

Can you fix it? I need the returned dataframe, not a ggplot object, thus plot_missing is unlikely to be of help.

boxuancui commented 4 years ago

Unfortunately, this is by design. profile_missing does only 1 thing, which is to profile and return missing values. Then further analysis is pushed to plot_missing, so that you can either use the default grouping, or customized grouping (See #98 and code change).

However, if you would like to apply the grouping, feel free to wrap it around another function, which is the recommended way to approach this:

library(DataExplorer)
library(data.table)

profile_missing_with_group <- function(data) {
  missing_value <- profile_missing(data.table(data))
  missing_value[pct_missing < 0.05, group := "Good"]
  missing_value[pct_missing >= 0.05 & pct_missing < 0.4, group := "OK"]
  missing_value[pct_missing >= 0.4 & pct_missing < 0.8, group := "Bad"]
  missing_value[pct_missing >= 0.8, group := "Remove"][]
  missing_value
}

profile_missing_with_group(airquality)
#    feature num_missing pct_missing group
# 1:   Ozone          37  0.24183007    OK
# 2: Solar.R           7  0.04575163  Good
# 3:    Wind           0  0.00000000  Good
# 4:    Temp           0  0.00000000  Good
# 5:   Month           0  0.00000000  Good
# 6:     Day           0  0.00000000  Good

Thank you!

AndreaPi commented 4 years ago

Actually, what I really need is the ggplot object returned by plot_missing, but not the plot, because I want to combine multiple such objects in a single plot using the patchwork package. Thus I just modified plot_missing by removing the part that does the actual plotting:

create_plot_missing_object <- function(data, group = list("Good" = 0.05, "OK" = 0.4, "Bad" = 0.8, "Remove" = 1), title = NULL, ggtheme = theme_gray(), theme_config = list("legend.position" = c("bottom"))) {
  ## Declare variable first to pass R CMD check
  pct_missing <- Band <- NULL
  ## Profile missing values
  missing_value <- data.table::data.table(profile_missing(data))
  ## Sort group based on value
  group <- group[sort.list(unlist(group))]
  invisible(lapply(seq_along(group), function(i) {
    if (i == 1) {
      missing_value[pct_missing <= group[[i]], Band := names(group)[i]]
    } else {
      missing_value[pct_missing > group[[i-1]] & pct_missing <= group[[i]], Band := names(group)[i]]
    }
  }))
  ## Create ggplot object
  output <- ggplot(missing_value, aes_string(x = "feature", y = "num_missing", fill = "Band")) +
    geom_bar(stat = "identity") +
    geom_label(aes(label = paste0(round(100 * pct_missing, 2), "%"))) +
    scale_fill_discrete("Band") +
    coord_flip() +
    xlab("Features") + ylab("Missing Rows")
}

However, I don't like this solution because ofc my code will diverge in time with respect to yours, as you keep developing plot_missing. Could you instead add a flag to plot_missing, to choose whether to make the plot or only return theggplot` object? Something like:

plot_missing <- function(data, group = list("Good" = 0.05, "OK" = 0.4, "Bad" = 0.8, "Remove" = 1), title = NULL, ggtheme = theme_gray(), theme_config = list("legend.position" = c("bottom")), make_plot = TRUE) {
  ## Declare variable first to pass R CMD check
  pct_missing <- Band <- NULL
  ## Profile missing values
  missing_value <- data.table(profile_missing(data))
  ## Sort group based on value
  group <- group[sort.list(unlist(group))]
  invisible(lapply(seq_along(group), function(i) {
    if (i == 1) {
      missing_value[pct_missing <= group[[i]], Band := names(group)[i]]
    } else {
      missing_value[pct_missing > group[[i-1]] & pct_missing <= group[[i]], Band := names(group)[i]]
    }
  }))
  ## Create ggplot object
  output <- ggplot(missing_value, aes_string(x = "feature", y = "num_missing", fill = "Band")) +
    geom_bar(stat = "identity") +
    geom_label(aes(label = paste0(round(100 * pct_missing, 2), "%"))) +
    scale_fill_discrete("Band") +
    coord_flip() +
    xlab("Features") + ylab("Missing Rows")

  if (make_plot) {
    ## Plot object
    class(output) <- c("single", class(output))
    plotDataExplorer(
      plot_obj = output,
      title = title,
      ggtheme = ggtheme,
      theme_config = theme_config
      )
  }
}
boxuancui commented 4 years ago

Maybe I am missing something, but the returned object is already a ggplot object:

library(DataExplorer)
library(ggplot2)

## Set return object
out <- plot_missing(airquality)

## Check if return is ggplot object
is.ggplot(out)
# TRUE

## Make additional changes to the returned ggplot object
out + theme_minimal()
AndreaPi commented 4 years ago

As I said before:

what I really need is the ggplot object returned by plot_missing, but not the plot,

plot_missing doesn't simply return a ggplot object. It also has the side effect of dawing a plot. In my case, I only need the ggplot object, and I don't want any plot to be drawn to screen.

If you can add a flag to plot_missing, controlling if (a ggplot object is created and an actual plot is drawn) or (a ggplot object is created but no plot is drawn), as in my example code above, this won't impact existing code, and will also satisfy my use case.

boxuancui commented 4 years ago

I am still having trouble following your ask. Your code returns nothing (if set make_plot to FALSE), which will give you NULL if you assign it to something. Alternatively, if you set make_plot to TRUE, it is identical to the original function.

If all you want is to not plot but just return the object, why don't wrap invisible around the function? Example:

out <- invisible(plot_missing(airquality))
is.ggplot(out)
# TRUE

It still returns the ggplot object, but does not draw anything.

AndreaPi commented 4 years ago

Your code returns nothing (if set make_plot to FALSE), which will give you NULL if you assign it to something.

Shouldn't it return output, since it was the last assigned variable?

If all you want is to not plot but just return the object, why don't wrap invisible around the function?

Because I didn't know about invisible 😬I'll test it, and if it works, I'll close the issue. Thanks!!!

AndreaPi commented 4 years ago

Nope! Not working. I just tried

> out <- invisible(plot_missing(airquality))

in RStudio, and this plot was generated:

image

invisible is not preventing the plot to be drawn. And this is especially annoying in my use case, because I'm using plot_missing multiple times in a RMarkdown document, thus I'm polluting it with a lot of plots which I don't want to be shown. I only need the corresponding ggplot objects, which I then patch together in a single plot using the patchwork package.

AndreaPi commented 4 years ago

I confirm that using invisible doesn't solve the issue. Instead, this modification to your code works and (when make_plot=FALSE) it just returns the plot object, without drawing anything:

plot_missing <- function(data, group = list("Good" = 0.05, "OK" = 0.4, "Bad" = 0.8, "Remove" = 1), title = NULL, ggtheme = theme_gray(), theme_config = list("legend.position" = c("bottom")), make_plot = TRUE) {
  ## Declare variable first to pass R CMD check
  pct_missing <- Band <- NULL
  ## Profile missing values
  missing_value <- data.table::data.table(profile_missing(data))
  ## Sort group based on value
  group <- group[sort.list(unlist(group))]
  invisible(lapply(seq_along(group), function(i) {
    if (i == 1) {
      missing_value[pct_missing <= group[[i]], Band := names(group)[i]]
    } else {
      missing_value[pct_missing > group[[i-1]] & pct_missing <= group[[i]], Band := names(group)[i]]
    }
  }))
  ## Create ggplot object
  output <- ggplot(missing_value, aes_string(x = "feature", y = "num_missing", fill = "Band")) +
    geom_bar(stat = "identity") +
    geom_label(aes(label = paste0(round(100 * pct_missing, 2), "%"))) +
    scale_fill_discrete("Band") +
    coord_flip() +
    xlab("Features") + ylab("Missing Rows")

  if (make_plot) {
    ## Plot object
    class(output) <- c("single", class(output))
    plotDataExplorer(
      plot_obj = output,
      title = title,
      ggtheme = ggtheme,
      theme_config = theme_config
      )
  }
  return(output)
}
boxuancui commented 4 years ago

I think I understand your request finally, but there is no easy solution. All plotting functions are passed to individual S3 methods. In this case, plotDataExplorer.single. Your code works for your case, but will break everything else, e.g., themes and title can no longer be customized, since these are wrapped inside that S3 method. create_report will break on plot_missing as well if themes/title needs to be customized.

I also realized why invisible doesn't work because a print is called explicitly here: https://github.com/boxuancui/DataExplorer/blob/master/R/plot.r#L90.

Now, I do not know your workflow using patchwork or rmarkdown, but I am sure there is a much easier way to ignore printed charts and knit ggplot objects later on. Alternatively, you can manually overwrite this function and use your version, as people have been doing it.

AndreaPi commented 4 years ago

Now, I do not know your workflow using patchwork or rmarkdown, but I am sure there is a much easier way to ignore printed charts and knit ggplot objects later on.

Good suggestion: this works, and I don't have to overwrite your function, so I'm going to incur any regression. Thanks!