ericka-howard / masters-project-lcz-classification

A project assessing suitability of random forests for Local Climate Zone classification in Hong Kong.
7 stars 0 forks source link

summarise vs. mutate vs. separate calculations in get_f1_score() function #3

Closed ericka-howard closed 3 years ago

ericka-howard commented 3 years ago

https://github.com/erickabsmith/masters-project-lcz-classification/blob/8ca39e4c3bb025b98fe4d6026ce009606014178e/R/get_f1_score.R#L1-L24

@cwickham I think my get_f1_score() function isn't quite right. It's a problem I keep having where I want to use summarize() but also want to do multiple calculations. I don't think summarize_at() is correct but I also think mutate() would give the wrong numbers since I'm wanting to count rows in steps. The formulas for UA, PA, and F-1 are:

image

Not as important but in the same vein, is the convention that I should have those two functions ( do_f1_calculations() and get_f1_score() ) in the same file since they require each other, or should they still be separate .R files?

cwickham commented 3 years ago

There is no problem returning multiple items with summarise(). Like mutate() you just define more than one column. summarise() is definitely what you want here, becasue you are going from many rows (e.g. many pixels) down to a single row summary (one set of numbers for all the pixels where either the true or predicted LCZ is current_lcz).

Two other things to be careful with:

Implementing those changes, gives:

do_f1_calculations <- function(dat, current_lcz){
  current_lcz %<>% factor(levels = lvls_union(dat %>%  select(lcz, lcz_predicted)))
  ua_pa_f1 <- dat %>%
    filter(lcz == current_lcz | lcz_predicted == current_lcz) %>%
    # not sure how the following should proceed
    summarise(ua = sum(lcz == current_lcz &
                    lcz_predicted == current_lcz) / sum(lcz_predicted == current_lcz),
           pa = sum(lcz == current_lcz &
                    lcz_predicted == current_lcz) / sum(lcz == current_lcz),
           f1 = (2 * ua * pa) / (ua + pa))
  ua_pa_f1
}
dat <- tibble(
  lcz = factor(c(1, 1, 1, 1, 2, 2, 2, 2)),
  lcz_predicted = factor(c(1, 1, 1, 2, 1, 2, 2, 2))
)
do_f1_calculations(dat, 1)
## A tibble: 1 x 3
#     ua    pa    f1
# <dbl> <dbl> <dbl>
# 1  0.75  0.75  0.75

Then I'd use map_dfr() in get_f1_score(), so it looks like:

get_f1_score <- function(dat){
  map_dfr(1:17, ~do_f1_calculations(dat, .x))
}
get_f1_score()

With the output:

# A tibble: 17 x 3
       ua     pa     f1
    <dbl>  <dbl>  <dbl>
 1   0.75   0.75   0.75
 2   0.75   0.75   0.75
 3 NaN    NaN    NaN   
 4 NaN    NaN    NaN   
 5 NaN    NaN    NaN   
 6 NaN    NaN    NaN   
 7 NaN    NaN    NaN   
 8 NaN    NaN    NaN   
 9 NaN    NaN    NaN   
10 NaN    NaN    NaN   
11 NaN    NaN    NaN   
12 NaN    NaN    NaN   
13 NaN    NaN    NaN   
14 NaN    NaN    NaN   
15 NaN    NaN    NaN   
16 NaN    NaN    NaN   
17 NaN    NaN    NaN 

(All the NaNs here are just because the input data didn't have many of the LCZ present).

I have a suggestion for an alternative way to implement this, I'll write something before we meet.

cwickham commented 3 years ago

Oh and on the side note: there is no need for these s be in the same file, but I think it makes sense to keep them in the same file - they are so interconnected, if you are editing one, you probably want to be able to see the other.

cwickham commented 3 years ago

If you want to avoid the explicit loop entirely, another approach would be to get the summary counts by grouping by the true, and predicted labels separately, then join.

E.g.

get_f1_score <- function(dat){
  # values for PA
  pa <- dat %>% 
    group_by(lcz = lcz) %>% 
    summarise(
      n_true = n(),
      n_correct = sum(lcz == lcz_predicted)
    )

  # denominator for UA
  ua <- dat %>% 
    group_by(lcz = lcz_predicted) %>% 
    summarise(
      n_predicted = n()
    )

  # do calculations
  pa %>% 
    left_join(ua) %>% 
    mutate(
      ua = n_correct/n_predicted,
      pa = n_correct/n_true,
      f1 = (2 * ua * pa) / (ua + pa))
}

One downside with this (and your approach) is the assumption that the columns lcz, lcz_predicted exist in the input data. I'd at least add a check for this. Alternatively, this is where you could use data masking, and force the user to specify which columns contain the true and predicted values. That looks something like:

get_f1_score <- function(dat, true, predicted){
  # values for PA
  pa <- dat %>% 
    group_by(lcz = {{ true }}) %>% 
    summarise(
      n_true = n(),
      n_correct = sum({{ true }} == {{ predicted }})
    )

  # denominator for UA
  ua <- dat %>% 
    group_by(lcz = {{ predicted }}) %>% 
    summarise(
      n_predicted = n()
    )

  # do calculations
  pa %>% 
    left_join(ua) %>% 
    mutate(
      ua = n_correct/n_predicted,
      pa = n_correct/n_true,
      f1 = (2 * ua * pa) / (ua + pa))
}

get_f1_score(dat, true = lcz, predicted = lcz_predicted)