Closed ericka-howard closed 3 years ago
There is no problem returning multiple items with summarise()
. Like mutate()
you just define more than one column. summarise()
is definitely what you want here, becasue you are going from many rows (e.g. many pixels) down to a single row summary (one set of numbers for all the pixels where either the true or predicted LCZ is current_lcz
).
Two other things to be careful with:
current_lcz
to be a factor, it needs to have the same levels as the relevant columns in dat
.n()
only counts rows. If you need to count TRUE
s in a logical, use sum()
.Implementing those changes, gives:
do_f1_calculations <- function(dat, current_lcz){
current_lcz %<>% factor(levels = lvls_union(dat %>% select(lcz, lcz_predicted)))
ua_pa_f1 <- dat %>%
filter(lcz == current_lcz | lcz_predicted == current_lcz) %>%
# not sure how the following should proceed
summarise(ua = sum(lcz == current_lcz &
lcz_predicted == current_lcz) / sum(lcz_predicted == current_lcz),
pa = sum(lcz == current_lcz &
lcz_predicted == current_lcz) / sum(lcz == current_lcz),
f1 = (2 * ua * pa) / (ua + pa))
ua_pa_f1
}
dat <- tibble(
lcz = factor(c(1, 1, 1, 1, 2, 2, 2, 2)),
lcz_predicted = factor(c(1, 1, 1, 2, 1, 2, 2, 2))
)
do_f1_calculations(dat, 1)
## A tibble: 1 x 3
# ua pa f1
# <dbl> <dbl> <dbl>
# 1 0.75 0.75 0.75
Then I'd use map_dfr()
in get_f1_score()
, so it looks like:
get_f1_score <- function(dat){
map_dfr(1:17, ~do_f1_calculations(dat, .x))
}
get_f1_score()
With the output:
# A tibble: 17 x 3
ua pa f1
<dbl> <dbl> <dbl>
1 0.75 0.75 0.75
2 0.75 0.75 0.75
3 NaN NaN NaN
4 NaN NaN NaN
5 NaN NaN NaN
6 NaN NaN NaN
7 NaN NaN NaN
8 NaN NaN NaN
9 NaN NaN NaN
10 NaN NaN NaN
11 NaN NaN NaN
12 NaN NaN NaN
13 NaN NaN NaN
14 NaN NaN NaN
15 NaN NaN NaN
16 NaN NaN NaN
17 NaN NaN NaN
(All the NaN
s here are just because the input data didn't have many of the LCZ present).
I have a suggestion for an alternative way to implement this, I'll write something before we meet.
Oh and on the side note: there is no need for these s be in the same file, but I think it makes sense to keep them in the same file - they are so interconnected, if you are editing one, you probably want to be able to see the other.
If you want to avoid the explicit loop entirely, another approach would be to get the summary counts by grouping by the true, and predicted labels separately, then join.
E.g.
get_f1_score <- function(dat){
# values for PA
pa <- dat %>%
group_by(lcz = lcz) %>%
summarise(
n_true = n(),
n_correct = sum(lcz == lcz_predicted)
)
# denominator for UA
ua <- dat %>%
group_by(lcz = lcz_predicted) %>%
summarise(
n_predicted = n()
)
# do calculations
pa %>%
left_join(ua) %>%
mutate(
ua = n_correct/n_predicted,
pa = n_correct/n_true,
f1 = (2 * ua * pa) / (ua + pa))
}
One downside with this (and your approach) is the assumption that the columns lcz
, lcz_predicted
exist in the input data. I'd at least add a check for this. Alternatively, this is where you could use data masking, and force the user to specify which columns contain the true and predicted values. That looks something like:
get_f1_score <- function(dat, true, predicted){
# values for PA
pa <- dat %>%
group_by(lcz = {{ true }}) %>%
summarise(
n_true = n(),
n_correct = sum({{ true }} == {{ predicted }})
)
# denominator for UA
ua <- dat %>%
group_by(lcz = {{ predicted }}) %>%
summarise(
n_predicted = n()
)
# do calculations
pa %>%
left_join(ua) %>%
mutate(
ua = n_correct/n_predicted,
pa = n_correct/n_true,
f1 = (2 * ua * pa) / (ua + pa))
}
get_f1_score(dat, true = lcz, predicted = lcz_predicted)
https://github.com/erickabsmith/masters-project-lcz-classification/blob/8ca39e4c3bb025b98fe4d6026ce009606014178e/R/get_f1_score.R#L1-L24
@cwickham I think my get_f1_score() function isn't quite right. It's a problem I keep having where I want to use
summarize()
but also want to do multiple calculations. I don't thinksummarize_at()
is correct but I also thinkmutate()
would give the wrong numbers since I'm wanting to count rows in steps. The formulas for UA, PA, and F-1 are:Not as important but in the same vein, is the convention that I should have those two functions (
do_f1_calculations()
andget_f1_score()
) in the same file since they require each other, or should they still be separate.R
files?