dcousin3 / superb

Summary plots with adjusted error bars
https://dcousin3.github.io/superb
19 stars 2 forks source link

factor levels are shown incorrectly in the outputs #4

Closed achetverikov closed 1 year ago

achetverikov commented 1 year ago

Factors levels in the outputs are not sorted properly.

library(superb)
library(data.table)

set.seed(256)
data <- data.table(expand.grid(A = c('E', 'F'), B = c('E', 'F'), subj = 1:10, trial = 1:20))
data[,A:=factor(A)]
data[,B:=factor(B)]

data[,y:=rnorm(.N)]

wide_data <- dcast(data, subj~A+B, value.var = 'y', fun.aggregate = mean)

superbData(wide_data, WSFactors = c('A(2)','B(2)'), variables = colnames(wide_data[,2:5]))$summaryStatistics

colmeans(wide_data[,2:5])

Gives:

> superbData(wide_data, WSFactors = c('A(2)','B(2)'), variables = colnames(wide_data[,2:5]))$summaryStatistics
  A B      center lowerwidth upperwidth
1 1 1 -0.13316853 -0.1826856  0.1826856
2 1 2  0.01258029 -0.1065945  0.1065945
3 2 1  0.06947037 -0.1426781  0.1426781
4 2 2  0.08321091 -0.1844031  0.1844031
> 
> colmeans(wide_data[,2:5])
        E_E         E_F         F_E         F_F 
-0.13316853  0.06947037  0.01258029  0.08321091 

So A==1 & B == 2 now correspond to F_E instead of E_F.

A related problem is that the factor labels are not preserved.

dcousin3 commented 1 year ago

Hello,

To suberb, the input data frame is wide_data which has four columns,

    subj         E_E         E_F         F_E         F_F
 1:    1 -0.44449237 -0.03235123  0.33117911  0.24256811
 2:    2 -0.16501937  0.06406321 -0.15657197  0.20675888
 3:    3 -0.23265304 -0.01147290  0.07758697  0.31571311
 4:    4 -0.52565753  0.17656301 -0.06597765  0.57736573
 5:    5  0.19321119  0.10167027 -0.14578568 -0.01145581
 6:    6 -0.14903764  0.40485607  0.08435495 -0.23097020
 7:    7  0.07729921  0.22796934 -0.03298668 -0.09575244
 8:    8 -0.07237976  0.13240136  0.05019913 -0.01337560
 9:    9  0.25639642 -0.01419813 -0.11931754 -0.24855318
10:   10 -0.26935239 -0.35479730  0.10312231  0.08981053 

The columns are labelled with the first index E changing less frequently than the second index E, then F. Hence, the correct instruction necessitate to place the factor B first:

superbData(wide_data, WSFactors = c('B(2)','A(2)'),   # inverted A and B here
     variables = colnames(wide_data[,2:5]))$summaryStatistics

Because in within-subject design, column names contain no information as to how to interpret the levels, superb adds a message (FYI) to make sure that they are interpreted as you desired:

superb::FYI: Here is how the within-subject variables are understood:
 B A variable
 1 1      E_E
 2 1      E_F
 1 2      F_E
 2 2      F_F

In your example, the confusion comes from the fact that the dcast function cycles through all the levels of the second factor first. This choice is arbitrary and other reformating functions adopted the other ordering (e.g., lsr, Navarro) which is the convention adopted in superb. From the wide format data structure, it is not possible to know how it was generated. Hence, any information you might have in data cannot be known.

Alternatively, you can specify manually how to interpret all the columns with WSDesign:

superbData(wide_data, WSFactors = c('A(2)','B(2)'), 
    variables = colnames(wide_data[,2:5]),
    WSDesign = list( a1b1=c(1,1), a1b2=c(1,2), a2b1=c(2,1), a2b2=c(2,2) )
    )$summaryStatistics

to which the FYI message is

superb::FYI: Here is how the within-subject variables are understood:
 A B variable
 1 1      E_E
 1 2      E_F
 2 1      F_E
 2 2      F_F

The four names in the WSDesign list (e.g., a1b1) are arbitrary, but their order matches the columns of wide_data.

"A related problem is that the factor labels are not preserved." So is the case with wide_data and colMeans(wide_data[,2:5]).

dcousin3 commented 1 year ago

Hope it helps!

achetverikov commented 1 year ago

OK, thanks! This is indeed helpful.