inspect_cat and comparison plots

Hi Alastair. Thanks for the package. I've tried to do categorical-comparison plots between two data-frames (the two being partitions of some training data based on target-values).

Some example data might explain my problem a bit better:

Reprex:

library(tibble)
library(dplyr)
library(inspectdf)

df <- tibble(
  a = c(rep("x", 4), rep("y", 2), rep("x", 1), rep("y", 5)),
  target = c(rep(0, 6), rep(1, 6))
)

inspect_cat(
  df %>% filter(target == 0),
  df %>% filter(target == 1)
) %>%
  show_plot()

This results in the following image:

For category "a"

the level "x" is most common when target==0, and
level "y" is most common when target==1

I was wondering whether the level reordering is supposed to work as it does in the figure (x-first for the first data-frame, y-first for the second) or whether this might be a bug. Do you think it might make more sense for the levels to be ordered by their frequency across the combined data-frames (there are 7 ys and 5 xs here, so maybe y should come first for both dataframes)

My original aim was to quickly identify categorical vars that distinguish positive from negative samples, but this is a bit obscured when scanning down the figure (for a dozen categories), because the levels are presented in an inconsistent order for the two data-frames that are being compared.

Aside: am I correct in thinking that the planned grouped-df API would allow the above, without needing to partition the original dataframe; that is, like df %>% group_by(target) %>% inspect_cat() %>% show_plot()

Hey Russ, thanks a lot for the suggestion. The behaviour you've pointed out is the intended behaviour, but I don't think it's very optimal - I think this is an important point, and something I've been considering.

What I'm thinking I might do is a new argument to show_plot() that modifies the ordering of categories. One option would be to replicate what currently happens (maybe setting as the default), another option would sort by overall frequency across both data frames (I think this is your idea), and another one that would fix the ordering according to the frequency in whichever data frame is provided as the first argument.

I think the grouped-df API would inherit the same issue - because the problem here is in the plotting rather than the inspect_cat() function itself.

alastairrushworth / inspectdf

inspect_cat and comparison plots #38