bluefoxr / COINr

COINr
https://bluefoxr.github.io/COINr/
Other
25 stars 7 forks source link

Custom Aggregation Function Error: base R sum() #54

Closed gdickens closed 5 months ago

gdickens commented 5 months ago

Firstly, thanks for all your great work creating this package.

I design, construct and analyze indices a fair bit for my work, but have been doing it from the ground up as I didn't know this packaged existed. This package is going to be extremely useful, so thanks.

Issue: I was trying to mirror the examples shown in the documentation, but wanted to use base R's sum function to aggregate index scores at each level without applying weights.

I was expecting the result to be equivalent to producing a grouped summaries at each index level via dplyr. However, Aggregate() appeared to be summing the values at each dimension and adding weights based on the number of variables that were present at each level.

I believe this is because the sum() function doesn't know how to handle the weights so it just adds along with the values (all my weights were set to 1).

When I tried to set w= "none" to account for this the result changed, but they were either incorrect or Aggregate() responded with an error. The solution that worked was to create a custom function that was able to ignore the weights and to pass this to the f_ag argument:


fnc_customSum <- function(x, w) {
  sum(x, na.rm = TRUE)
}

dta_example_coin<-build_example_coin()

rlt_example_coin_custom<-Aggregate(x=dta_example_coin,
                          dset="Raw", 
                          f_ag  ="fnc_customSum")
bluefoxr commented 5 months ago

Hi, can you be a bit more specific what the expected result was with the sum() function and how the results were incorrect? E.g. provide a reproducible example, and what your expected outcome was? I can run your code chunk but the description of the errors before isn't clear to me. Thanks.

bluefoxr commented 5 months ago

Actually I might have spotted a bug here due to some recent changes. I will work on that but in the meantime please send me any examples where you spot errors and especially if you think the results are not correct.

gdickens commented 5 months ago

Hey @bluefoxr,

Apologies for not including a reproducible example. I was rushing out the door when adding this issue, but I know that's poor form on my part.

At the outset, I know using sum() as an aggregation method is unusual, but the index I'm working with does this.

The code below demonstrates the issue that I was facing and solved with the custom function.

In the example I basically produce aggregate scores by country using dplyr, which should be the equivalent of creating an index without weights and using addition as the aggregation procedure.

I then attempt to do the same thing with Aggregate() by specifying sum() as the custom function (f_ag ="sum") and adding the additional argument na.rm=TRUE via f_ag_para.

In my actual example all the weights are 1, so I didn't think it would matter, but when I tried adding w="none" to Aggregate() it returned an error "...f_ag_para must be a list". I tried a variety of formats based on the documentation, but couldn't get it to drop the weights.

If you look at the 'difference' column in the sum_totals_comparison object you'll see what I mean about the error being oddly consistent. I'm pretty sure what's happened is that the weights themselves have been added together with the indicator values at each level when sum() is applied by the Aggregate() function as the total is correct at level 1 (for most uCodes that is).

library(COINr)
library(tidyverse)

#build example coin
dta_example_coin<-build_example_coin()

#save the metadata as a separate object
dta_example_coin_meta_data<-dta_example_coin$Meta$Ind

#create long version of raw data and join with  meta data 
dta_example_coin_raw_data<-dta_example_coin$Data$Raw |>  
  pivot_longer(cols=-uCode, names_to="iCode") |> 
  left_join(dta_example_coin_meta_data)

#calculate the total of all indicators by country 
#(essentially using 'sum()' as the aggregation method at all levels)
sum_totals_index_totals_dplyr<-dta_example_coin_raw_data |> 
  group_by(uCode) |> 
  summarize(dplyr_value=sum(value, na.rm = TRUE))

#attempt to apply same approach using COINr's Aggregate() function: 

#Aggregate at each level use base R's sum():
rlt_example_coin<-Aggregate(dta_example_coin,
                            dset="Raw", 
                            f_ag  ="sum",
                            f_ag_para =list(na.rm=TRUE))

#save grand totals as long data set and join with metadata to make it possible
#to calculate totals at each level
sum_totals_index_totals_COINr<-rlt_example_coin$Data$Aggregated |>   
  pivot_longer(cols=-uCode, names_to="iCode") |> 
  left_join(dta_example_coin_meta_data)

#calculate grand totals at each level
sum_totals_index_totals_COINr<-sum_totals_index_totals_COINr |> 
  group_by(uCode,Level) |> 
  summarize(COINr_value=sum(value, na.rm = TRUE)) 

#join totals to allow comparisons 

sum_totals_comparison<-left_join(sum_totals_index_totals_COINr,
                                 sum_totals_index_totals_dplyr, 
                                 by="uCode") |> 
  mutate(difference= COINr_value-dplyr_value)
bluefoxr commented 5 months ago

Hi, thanks for this. A couple of things I discovered:

  1. Due to some recent changes I made, there was a bug in which passing w = "none" was throwing an error. I have now fixed this.
  2. The results you get from your aggregation are due to how sum() works. You are effectively passing sum(x,w) which indeed sums the weights together with x. Try sum(c(1,2,3),c(1,1,1)) for example. Since neither x or w are named arguments of sum() it treats them both as vectors to be summed all together.

So in sum()ary, there was a bug, but the results were correct. This command should now work fine and gives the correct results:

coin <- Aggregate(coin, dset = "Raw", f_ag = "sum",
                    f_ag_para = list(na.rm = TRUE), w = "none"
  )

Let me know if you agree with this. Thanks.

gdickens commented 5 months ago

Hey @bluefoxr,

Yeah that makes perfect sense. You'll see the custom sum function I created basically just throws out 'w' and applies the sum function because I figured that's what was happening.

Again, I doubt many people are going to be aggregating indicators like this, but I thought I'd flag it in case it had wider implications.

Your solution makes perfect sense. Thanks again for all your great work. It's a brilliant package.

bluefoxr commented 5 months ago

Ok glad it's sorted. I'll close the issue now.