Open aghaynes opened 4 years ago
having said, that I tried to make a reprex, and failed... I have no idea whats going on...
library(atable)
#> Warning: package 'atable' was built under R version 3.6.3
library(Hmisc)
#> Loading required package: lattice
#> Loading required package: survival
#> Warning: package 'survival' was built under R version 3.6.2
#> Loading required package: Formula
#> Loading required package: ggplot2
#>
#> Attaching package: 'Hmisc'
#> The following objects are masked from 'package:base':
#>
#> format.pval, units
data(mtcars)
mtcars$gear2 <- mtcars$gear3 <- mtcars$gear4 <- factor(mtcars$gear, 3:5, c("three", "four", "five"))
mtcars$ex1 <- factor(sample(c("No", "No2", "Yes"), 32, TRUE), c("No", "No2", "Yes"), c("No", "No2", "Yes"))
mtcars$ex2 <- factor(sample(c("No", "No2", "Yes"), 32, TRUE), c("No", "No2", "Yes"), c("No", "No2", "Yes"))
mtcars$ex3 <- factor(sample(c("No", "No2", "Yes"), 32, TRUE), c("No", "No2", "Yes"), c("No", "No2", "Yes"))
label(mtcars$ex2) <- "Foo"
label(mtcars$ex3) <- "Bar"
atable(mtcars, target_cols = c("gear", "gear2", "gear3", "gear4",
"ex1", "ex2", "ex3"),
blocks = list("block" = c("gear3", "gear4"),
Ex = c("ex1", "ex2", "ex3")),
group_col = "am", format_to = "console")
#> Warning in stats::ks.test(x, y, alternative = c("two.sided"), ...): cannot
#> compute exact p-value with ties
#> Warning in stats::chisq.test(group, value): Chi-squared approximation may
#> be incorrect
#> Warning in stats::chisq.test(group, value): Chi-squared approximation may
#> be incorrect
#> Warning in stats::chisq.test(group, value): Chi-squared approximation may
#> be incorrect
#> Warning in stats::chisq.test(group, value): Chi-squared approximation may
#> be incorrect
#> Warning in stats::chisq.test(group, value): Chi-squared approximation may
#> be incorrect
#> Warning in stats::chisq.test(group, value): Chi-squared approximation may
#> be incorrect
#> Group 0 1 p stat
#> 1 Observations
#> 2 19 13
#> 3 gear
#> 4 Mean (SD) 3.2 (0.42) 4.4 (0.51) <0.001 0.79
#> 5 valid (missing) 19 (0) 13 (0)
#> 6 gear2
#> 7 three 79% (15) 0% (0) <0.001 21
#> 8 four 21% (4) 62% (8)
#> 9 five 0% (0) 38% (5)
#> 10 missing 0% (0) 0% (0)
#> 11 block
#> 12 gear3
#> 13 three 79% (15) 0% (0) <0.001 21
#> 14 four 21% (4) 62% (8)
#> 15 five 0% (0) 38% (5)
#> 16 missing 0% (0) 0% (0)
#> 17 gear4
#> 18 three 79% (15) 0% (0) <0.001 21
#> 19 four 21% (4) 62% (8)
#> 20 five 0% (0) 38% (5)
#> 21 missing 0% (0) 0% (0)
#> 22 Ex
#> 23 ex1
#> 24 No 21% (4) 23% (3) 0.94 0.13
#> 25 No2 42% (8) 46% (6)
#> 26 Yes 37% (7) 31% (4)
#> 27 missing 0% (0) 0% (0)
#> 28 Foo
#> 29 No 68% (13) 46% (6) 0.31 2.3
#> 30 No2 21% (4) 23% (3)
#> 31 Yes 11% (2) 31% (4)
#> 32 missing 0% (0) 0% (0)
#> 33 Bar
#> 34 No 26% (5) 0% (0) 0.13 4.1
#> 35 No2 37% (7) 46% (6)
#> 36 Yes 37% (7) 54% (7)
#> 37 missing 0% (0) 0% (0)
#> Effect Size (CI)
#> 1
#> 2
#> 3
#> 4 -2.6 (-3.6; -1.6)
#> 5
#> 6
#> 7 0.81 (0.65; 0.91)
#> 8
#> 9
#> 10
#> 11
#> 12
#> 13 0.81 (0.65; 0.91)
#> 14
#> 15
#> 16
#> 17
#> 18 0.81 (0.65; 0.91)
#> 19
#> 20
#> 21
#> 22
#> 23
#> 24 0.063 (0; 0.4)
#> 25
#> 26
#> 27
#> 28
#> 29 0.27 (0; 0.57)
#> 30
#> 31
#> 32
#> 33
#> 34 0.36 (0.016; 0.63)
#> 35
#> 36
#> 37
Created on 2020-08-03 by the reprex package (v0.3.0)
The raw output (format_to = "raw") show all levels in the appropriate order...
Sorry for bombarding with messages.... Stranger still, it's not really shuffling the options randomly... Using 2 calls to atable, one with, one without a group_col, the Group variable is the same... (left is with the group_col, right is without)
I am not sure whats happening there.
Some thoughts: your variables Dementia, Leukemia and Metastatic Cancer all have the same levels. (No, missing Yes, no previous data). Are the levels in the same order? Perhaps some clash happens internally in atable when a sort or merge is done.
When format =raw, then the stats and tests are calculated, but are not arranged in a table. Also no blocking is done in this case. So the error happens afterwards.
Does the shuffle happen, when you remove the blocks?
Do you have another block in the atable call, or just one block called General Comorbidities?
I think i can create a random data.frame with the columns as factors as above. I will look into this
I tried to reproduce the issue, but failed. Output as expected:
library(atable)
get_data = function(n)factor(sample(c(1,2,3), size=n, replace = TRUE), levels = c(1,2,3), labels = c("No previous Data", "Yes", "No"))
n=750
DD = data.frame(Dementia = get_data(n),
Metastatic_Cancer = get_data(n),
Leukemia = get_data(n))
DD$Metastatic_Cancer = relevel(DD$Metastatic_Cancer, ref = "Yes") # put this level first
atable::atable(DD, colnames(DD),
blocks=list("General Comorbidities" = colnames(DD)),
format_to = "Console")
#> Group value
#> 1 Observations
#> 2 750
#> 3 General Comorbidities
#> 4 Dementia
#> 5 No previous Data 35% (266)
#> 6 Yes 31% (236)
#> 7 No 33% (248)
#> 8 missing 0% (0)
#> 9 Metastatic Cancer
#> 10 Yes 32% (239)
#> 11 No previous Data 34% (254)
#> 12 No 34% (257)
#> 13 missing 0% (0)
#> 14 Leukemia
#> 15 No previous Data 32% (241)
#> 16 Yes 34% (257)
#> 17 No 34% (252)
#> 18 missing 0% (0)
Created on 2020-08-04 by the reprex package (v0.3.0)
Order of the factors is c("No previous Data", "Yes", "No") for Dementia and Leukemia. Metastatic Cancer has label Yes as first label and "No previous Data" as second.
Some thoughts: your variables Dementia, Leukemia and Metastatic Cancer all have the same levels. (No, missing Yes, no previous data). Are the levels in the same order? Perhaps some clash happens internally in atable when a sort or merge is done.
Yes, all the variables are created via an apply. I even tried setting ordered to TRUE in the in factor call...
When format =raw, then the stats and tests are calculated, but are not arranged in a table. Also no blocking is done in this case. So the error happens afterwards.
Agreed
Does the shuffle happen, when you remove the blocks?
No
Do you have another block in the atable call, or just one block called General Comorbidities?
I have a second block with charlson comorbidities (yes, they overlap to a large degree, but for the time being i need both). If I include only the general comorbidities block, the dementia variable in the charlson block is also shuffled (but only that one)...
I tried to reproduce the issue, but failed.
Me too. I have no idea whats going on here... maybe I'll have to step through the code line by line... if I find something useful, I'll let you know...
So as the shuffling only happens with blocks, the new code in function atable:::indent_data_frame_with_blocks is most likely to cause the shuffling.
As you wrote, the variables "Dementia (Charlson)" is also part of your atable call, and this variable has some levels in common with variables in the other blocks (the levels 'Yes' and 'No' are common with variable "Dementia").
Could you please state your full atable call and also all levels of the variables of the data.frame. Perhpas then I can create a test data.frame to reproduce the issue.
ok, so I could reproduce at least one unexpected/unintended feature/bug, see code below: The variables 'Dementia' and 'Dementia_2' both have the two levels "Yes" and "No", but in different order. 'Dementia' has 'No' before 'Yes', 'Dementia_2' has 'Yes' before 'No'. The first call of atable has 'Yes' before 'No', as only 'Dementia_2' is analysed, as expected The second call of atable has 'No' before 'Yes' for all variables, even for 'Dementia_2'. So the order of the labels is changed. This is unexpected.
This is not an issue of blocking, but of the inverted order of labels of the two factor-variables 'Dementia' and 'Dementia_2'. This has existed since the very first version of atable. atable writes all labels of of variables in one columnm and then this column is sorted (my a plyr::ddply or a merge somewhere in the code). So there is only one order of the labels. The obvious fix is to get the data right: no duplicate labels, or when they are duplicated, then the labels should be in the same order. But this may not always be possible: I am thinking of two likert-scales with the same labels and one scale is the reverse of the other. This can happen in questionaires. But this does not reproduce the issue with p-values moved one row down (see post above from Aug 5, 2020).
library(atable)
get_data_1 = function(n)factor(sample(c(1,2,3), size=n, replace = TRUE), levels = c(1,2,3), labels = c("No previous Data", "No", "Yes"))
get_data_2 = function(n)factor(sample(c(1,2,3), size=n, replace = TRUE), levels = c(1,2,3), labels = c("asdf", "Yes", "No"))
get_data_3 = function(n)factor(sample(c(1,2), size=n, replace = TRUE), levels = c(1,2), labels = c("a","b"))
n=750
DD = data.frame(Dementia = get_data_1(n),
Metastatic_Cancer = get_data_1(n),
Leukemia = get_data_1(n),
Dementia_2 = get_data_2(n),
Metastatic_Cancer_2 = get_data_2(n),
Leukemia_2 = get_data_2(n),
group = get_data_3(n)
)
atable::atable(DD,
target_cols = c("Dementia", "Metastatic_Cancer", "Leukemia", "Dementia_2", "Metastatic_Cancer_2", "Leukemia_2"),
group_col="group",
blocks=list("General Comorbidities" = c("Dementia", "Metastatic_Cancer", "Leukemia"),
"Special Comorbidities" = c("Dementia_2", "Metastatic_Cancer_2", "Leukemia_2")),
format_to = "Console")
atable::atable(DD,
target_cols = c("Dementia_2", "Metastatic_Cancer_2", "Leukemia_2"),
group_col="group",
blocks=list("Special Comorbidities" = c("Dementia_2", "Metastatic_Cancer_2", "Leukemia_2")),
format_to = "Console")
This is the code
b1 <- atable(d,
target_cols = var_comp_info$variable_name,
group_col = "isdead_6m", indent_character = " ",
blocks = list("General comorbidities" = var_comp_info$variable_name[grepl("^com.+(factor|score)$", var_comp_info$variable_name)],
"Charlson comorbidities" = var_comp_info$variable_name[grepl("^cci.+(factor|score)$", var_comp_info$variable_name)]
)
)
where var_comp_info$variable_name is
[1] "age" "sex.factor"
[3] "bmi" "place_of_living.factor"
[5] "com_dementia.factor" "com_metastatic_cancer.factor"
[7] "com_leukemia_malign_cancer.factor" "com_lymphoma_myeloma.factor"
[9] "com_chron_pulmonary_dis.factor" "com_cor_artery_dis.factor"
[11] "com_cong_heart_failure.factor" "com_chron_liver_dis.factor"
[13] "com_chron_renal_dis.factor" "com_dm_w_endorg.factor"
[15] "com_periph_vasc_dis.factor" "Metastases.factor"
[17] "apache_respiration.factor" "apache_cardiovascular.factor"
[19] "apache_renal.factor" "apache_liver.factor"
[21] "apache_immunosystem.factor" "apache_nb.factor"
[23] "cci_ami.factor" "cci_chf.factor"
[25] "cci_pvd.factor" "cci_cevd.factor"
[27] "cci_dementia.factor" "cci_copd.factor"
[29] "cci_rheumd.factor" "cci_pud.factor"
[31] "cci_mld.factor" "cci_diab.factor"
[33] "cci_diabwc.factor" "cci_hp.factor"
[35] "cci_rend.factor" "cci_canc.factor"
[37] "cci_msld.factor" "cci_metacanc.factor"
[39] "cci_aids.factor" "cci_score"
[41] "cci_index.factor" "cci_wscore"
[43] "cci_windex.factor" "walking.factor"
[45] "indoor.factor" "stairs.factor"
[47] "dressing.factor" "transfer.factor"
[49] "Dailyliving.factor" "adl.factor"
[51] "hosp_adm_pre_6m.factor" "icu_adm_pre_6m.factor"
[53] "adm_type.factor" "icu_diagnosis.factor"
[55] "icu_adm_source.factor" "apache_II_wert_0"
[57] "treat_limit_during_icu_bin.factor" "patverf_at_icuadm.factor"
[59] "saps_II_wert_0" "dobutamin_0.factor"
[61] "noradr_0.factor" "adre_0.factor"
[63] "vasoactive_0.factor" "sofa_0"
[65] "s_crea_wert_0" "agba_lac_wert_0"
[67] "wbc_wert_0" "sofa_1_nfail_gt2"
The issue seems to be with the cci_*
and com_*
variables. They variables are constructed using sapply calls with the following functions (on different different sets)
# com_*
function(x){
factor(x, -1:1, c("No previous data", "No", "Yes"), ordered = TRUE)
}
#cci_*
function(x){
factor(x, 0:1, c("No", "Yes"))
}
As there's a lot of variables, I'll just give the summary of the variables...
summary(d)
pid age sex.factor bmi place_of_living.factor
Min. :10001 Min. :18.00 Male :1189 Min. :11.05 [1] Long-term care facility: 164
1st Qu.:10448 1st Qu.:54.00 Female: 583 1st Qu.:22.45 [3] Home :1509
Median :10898 Median :66.00 Median :25.50 NA's : 99
Mean :10900 Mean :63.02 Mean :26.44
3rd Qu.:11355 3rd Qu.:74.00 3rd Qu.:29.55
Max. :11801 Max. :95.00 Max. :49.38
NA's :638
com_dementia.factor com_metastatic_cancer.factor com_leukemia_malign_cancer.factor com_lymphoma_myeloma.factor
Length:1772 Length:1772 Length:1772 Length:1772
Class1:labelled Class1:labelled Class1:labelled Class1:labelled
Class2:character Class2:character Class2:character Class2:character
Mode :character Mode :character Mode :character Mode :character
com_chron_pulmonary_dis.factor com_cor_artery_dis.factor com_cong_heart_failure.factor
Length:1772 Length:1772 Length:1772
Class1:labelled Class1:labelled Class1:labelled
Class2:character Class2:character Class2:character
Mode :character Mode :character Mode :character
com_chron_liver_dis.factor com_chron_renal_dis.factor com_dm_w_endorg.factor com_periph_vasc_dis.factor
Length:1772 Length:1772 Length:1772 Length:1772
Class1:labelled Class1:labelled Class1:labelled Class1:labelled
Class2:character Class2:character Class2:character Class2:character
Mode :character Mode :character Mode :character Mode :character
Metastases.factor apache_respiration.factor apache_cardiovascular.factor apache_renal.factor
[0] No :1637 [0] No :1499 [0] No :1551 [0] No :1593
[1] Yes: 51 [1] Yes: 96 [1] Yes: 44 [1] Yes: 32
NA's : 84 NA's : 177 NA's : 177 NA's : 147
apache_liver.factor apache_immunosystem.factor apache_nb.factor cci_ami.factor cci_chf.factor
[0] No :1477 [0] No :1452 0 :1114 Length:1772 Length:1772
[1] Yes: 162 [1] Yes: 171 1-2 : 378 Class1:labelled Class1:labelled
NA's : 133 NA's : 149 >2 : 11 Class2:character Class2:character
NA's: 269 Mode :character Mode :character
cci_pvd.factor cci_cevd.factor cci_dementia.factor cci_copd.factor cci_rheumd.factor
Length:1772 Length:1772 Length:1772 Length:1772 Length:1772
Class1:labelled Class1:labelled Class1:labelled Class1:labelled Class1:labelled
Class2:character Class2:character Class2:character Class2:character Class2:character
Mode :character Mode :character Mode :character Mode :character Mode :character
cci_pud.factor cci_mld.factor cci_diab.factor cci_diabwc.factor cci_hp.factor
Length:1772 Length:1772 Length:1772 Length:1772 Length:1772
Class1:labelled Class1:labelled Class1:labelled Class1:labelled Class1:labelled
Class2:character Class2:character Class2:character Class2:character Class2:character
Mode :character Mode :character Mode :character Mode :character Mode :character
cci_rend.factor cci_canc.factor cci_msld.factor cci_metacanc.factor cci_aids.factor
Length:1772 Length:1772 Length:1772 Length:1772 Length:1772
Class1:labelled Class1:labelled Class1:labelled Class1:labelled Class1:labelled
Class2:character Class2:character Class2:character Class2:character Class2:character
Mode :character Mode :character Mode :character Mode :character Mode :character
cci_score cci_index.factor cci_wscore cci_windex.factor walking.factor
Min. :0.000 0 :333 Min. : 0.000 0 :333 [1] Independent :1519
1st Qu.:1.000 1-2:853 1st Qu.: 1.000 1-2:563 [2] Partially dependent: 67
Median :2.000 3-4:474 Median : 2.000 3-4:496 [3] Fully dependent : 58
Mean :1.983 >=5:112 Mean : 2.901 >=5:380 NA's : 128
3rd Qu.:3.000 3rd Qu.: 4.000
Max. :8.000 Max. :14.000
indoor.factor stairs.factor dressing.factor
[1] Independent :1551 [1] Independent :1516 [1] Independent :1563
[2] Partially dependent: 64 [2] Partially dependent: 66 [2] Partially dependent: 66
[3] Fully dependent : 37 [3] Fully dependent : 60 [3] Fully dependent : 19
NA's : 120 NA's : 130 NA's : 124
transfer.factor Dailyliving.factor adl.factor
[1] Independent :1566 [1] Independent :1449 [1] Independent :1403
[2] Partially dependent: 62 [2] Partially dependent: 210 [2] Partially dependent: 210
[3] Fully dependent : 22 [3] Fully dependent : 29 [3] Fully dependent : 75
NA's : 122 NA's : 84 NA's : 84
hosp_adm_pre_6m.factor icu_adm_pre_6m.factor adm_type.factor
[0] No admission :1336 [0] No admission :1535 Medical elective : 205
[1] 1-2 admissions: 350 [1] 1-2 admissions: 220 Medical emergency :1017
[2] >2 admissions : 86 [2] >2 admissions : 17 Surgical elective : 222
Surgical emergency: 328
icu_diagnosis.factor icu_adm_source.factor
[9] Respiratory failure :266 [1] Ward :304
[4] Emergency sugery :233 [2] ICU or IMC :166
[8] Other :209 [3] ED :605
[7] Non-traumatic cerebral pathology:183 [4] OR, recovery room or procedure suite:601
[2] Cardiovascular disease :176 [5] Other (incl. external hospital) : 96
[1] Cardiac arrest :174
(Other) :531
apache_II_wert_0 treat_limit_during_icu_bin.factor patverf_at_icuadm.factor saps_II_wert_0
Min. : 0.00 No :1313 [0] No :1700 Min. : 8.00
1st Qu.:21.00 Yes: 459 [1] Yes: 72 1st Qu.: 48.00
Median :27.00 Median : 62.00
Mean :26.87 Mean : 61.09
3rd Qu.:33.00 3rd Qu.: 75.00
Max. :54.00 Max. :120.00
NA's :313 NA's :213
dobutamin_0.factor noradr_0.factor adre_0.factor vasoactive_0.factor sofa_0 s_crea_wert_0
[0] No :1519 [0] No :773 [0] No :1424 [0] No : 670 Min. : 0.000 Min. : 21.0
[1] Yes: 253 [1] Yes:999 [1] Yes: 348 [1] Yes:1102 1st Qu.: 6.000 1st Qu.: 73.0
Median : 8.000 Median : 104.0
Mean : 8.523 Mean : 137.4
3rd Qu.:11.000 3rd Qu.: 163.0
Max. :20.000 Max. :1247.0
NA's :656 NA's :360
agba_lac_wert_0 wbc_wert_0 sofa_1_nfail_gt2 isdead_6m
Min. : 0.200 Min. : 0.01 Mode :logical No :1174
1st Qu.: 1.100 1st Qu.: 7.86 FALSE:1210 Yes: 598
Median : 1.900 Median : 11.40 TRUE :562
Mean : 3.065 Mean : 12.80
3rd Qu.: 3.900 3rd Qu.: 15.80
Max. :23.000 Max. :109.00
NA's :115 NA's :279
It looks like the com and cci variables are not proper factors though, based on the summary... maybe thats got something to do with it...
If I restrict the atable call to the com* and cci* variables, everything is fine - all options are in the correct order... even with the blocks... so there must be some interaction with another variable... although they all seem to be be coded No/Yes....
Thanks for your time! These are A LOT OF variables!
Just a quick guess: I am not sure how to create the class 'labelled' of e.g. variable 'com_metastatic_cancer.factor'. I guess the class is from package 'labelled' created with function 'to_labelled' and you are reading a SPSS or SAS file with package haven or foreign.
See the help ?haven::labelled. Quote: "This class (labelled) provides few methods, as I expect you'll coerce to a standard R class (e.g. a factor()) soon after importing. Unfortunately it's not possible to make as.factor work for labelled objects so instead use as_factor. This works for all types of labelled vectors."
So perhaps use haven::as_factor() instead of factor() of the base package on you data:
d <- haven::as_factor(d)
Then summary(d) should show proper factors. And atable should hopefully act as expected.
Internally atable calls statistics() on every target_col. statistics() has no method for class labelled, but it has one for class character. And statistics.character() also calls factor().
One more thought:
There could be duplicated aliases.
Run this code on your data.frame d with all variables:
Alias_mapping = atable::create_alias_mapping(d)
b = duplicated(Alias_mapping$new, fromLast = TRUE) | duplicated(Alias_mapping$new, fromLast = FALSE)
b should be all FALSE.
Show duplicated aliases:
Alias_mapping[b, ]
This must be empty! Or else atable will mix up the variables in the output. This is a possible explanation of the shuffling.
Currently atable does not check this kind of name clash. I think, I will add this check in the next version of atable.
I am not sure how to create the class 'labelled' of e.g. variable 'com_metastatic_cancer.factor'. I guess the class is from package 'labelled' created with function 'to_labelled' and you are reading a SPSS or SAS file with package haven or foreign.
Nope, an xlsx.
comorb1 <- readxl::read_xlsx(file.path(or, "comorbidities for Alan.xlsx"), sheet = 2)
# at this point its a set of numeric variables
comorb1a <- as.data.frame(sapply(comorb1[, 4:ncol(comorb1)], function(x){
factor(x, -1:1, c("No previous data", "No", "Yes"), ordered = TRUE)
}))
names(comorb1a) <- paste0(names(comorb1a), ".factor")
comorb1 <- cbind(comorb1, comorb1a)
But, I see now that the sapply doesn't return factors, but characters... (as opposed to a small test that did, where it returned factors)
> str(comorb1a)
'data.frame': 2010 obs. of 11 variables:
$ dementia : chr "No previous data" "No" "No" "No" ...
$ metastatic_cancer : chr "No previous data" "No" "Yes" "No" ...
$ leukemia_malign_cancer: chr "No previous data" "No" "Yes" "No" ...
$ lymphoma_myeloma : chr "No previous data" "No" "No" "No" ...
$ chron_pulmonary_dis : chr "No previous data" "No" "No" "No" ...
$ cor_artery_dis : chr "No previous data" "No" "No" "No" ...
$ cong_heart_failure : chr "No previous data" "No" "No" "No" ...
$ chron_liver_dis : chr "No previous data" "No" "No" "No" ...
$ chron_renal_dis : chr "No previous data" "No" "No" "No" ...
$ dm_w_endorg : chr "No previous data" "No" "No" "No" ...
$ periph_vasc_dis : chr "No previous data" "No" "No" "No" ...
The labelled part comes from hmisc::label, which you implemented as an alias, no?
Show duplicated aliases: Alias_mapping[b, ] This must be empty! Or else atable will mix up the variables in the output. This is a possible explanation of the shuffling.
This is indeed empty... 4 variables, no observations.
So this gets even more confusing.
Yes, I added support for class labelled of the Hmisc-Package. The call of atable::create_alias_mapping(d)
should return the aliases that you defined.
Lets do a real Minimal Working Example: as you cannot share your data, we need some other way to reproduce the shuffling.
atable does not need the full data.frame, it just need the classes of the columns.
So when d is the data.frame, that produces the shuffling in atable, you can create an empty data.frame with the same column classes by
e <- d[FALSE, ]
When you now call atable on e, you should get the same shuffling as for d. This should work when all columns are factors. It does not work with characters columns. Also the aliases should be preserved in e.
Can you save()
and send me this data.frame e (via Mail, or perhaps via this GitHub here)?
I came to the same conclusion on Monday and I sent you an R script via email (web.de account). Ca 11:30am. 😃
Got the file and could reproduce the shuffle. Let the search beginn!
great! good luck! 😄
I was able to fix something: The row containing p-values and test statistics is now the first of every variable. This fixes the shuffling with 'Dementia (Charlson)' in post: https://github.com/arminstroebel/atable/issues/7#issuecomment-669034120
I will upload the fix to CRAN as atable version 0.1.8. This version contains other fixes aswell. This shuffling with 'Dementia (Charlson) occurs with atable version 0.1.7, but not with 0.1.8.
What I was not able to fix:
The order of the levels of the first target_col overwrites the order of the other target cols in the atable-output, when the target_cols share some labels. This happens with and without blocking. And also with and without group_col.
Below is an example to demonstrate this kind of shuffling. The variable f1 has labelsA, B, C, D. The variable f2 has labelsD, C, B, A, so in reversed order. f3 hat some labels in comom with f1, but not all.
Calling atable with f1 as first target_cols will order the labels as A, B, C, D, as f1 is the first target_col. Calling atable with f2 as first target_cols will order the labels as D, C, B, A, as f2 is the first target_col.
Above is another example: https://github.com/arminstroebel/atable/issues/7#issuecomment-670446940
Internally atable stores all labels of all target cols in one column of a data.frame. This column is a factor and the labels are c()-ed together. Then a sort of this column happens implicitly by plyr::ddplyr or explicitly by doBy::orderBy() or merge() I presume, changing this needs a bigger rewrite of the package.
Example:
# create factors with colliding labels
library(atable)
atable_options(format_to="Console")
get_data = function(x, labels)factor(sample(c(1:length(labels)), size=n, replace = TRUE), levels = 1:length(labels), labels = labels)
n=42
DD = data.frame(f1 = get_data(n, c("A", "B", "C", "D")),
f2 = get_data(n, c("D", "C", "B", "A")),
f3 = get_data(n, c("F", "B", "C", "G")),
group = get_data(n, c("a","b"))
)
# order is A < B < C < D for all variables f1, f2 and f3:
atable::atable(DD,
target_cols = c("f1", "f2", "f3"),
group_col="group")
# order is D < C < B < A:
atable::atable(DD,
target_cols = c("f2", "f1", "f3"),
group_col="group")
# order is C < B < E < F:
atable::atable(DD,
target_cols = c("f3", "f1", "f2"),
group_col="group")
at least part of the puzzle is fixed and the route cause seems to have been found... thanks!
Internally atable stores all labels of all target cols in one column of a data.frame. This column is a factor and the labels are c()-ed together. Then a sort of this column happens implicitly by plyr::ddplyr or explicitly by doBy::orderBy() or merge()
I think i mentioned before at some point, maybe it's safer/easier to keep a 2 variable dataframe... or include the variable in the factor level and parse it away later
When factor variables are used in a block, it looks like the levels are (randomly) shuffled...
Here I have three factors all coded identically, but the options come out in different orders... and the test statistics are also not in the first row (although they are always with what should be the first row, which might give a hint to when the shuffling occurs...)
I cannot share the data unfortunately.
Any idea?