arminstroebel / atable

R-Package for Creating Tables for Clinical Trial Reports
9 stars 1 forks source link

blocks with factors shuffles the levels #7

Open aghaynes opened 4 years ago

aghaynes commented 4 years ago

When factor variables are used in a block, it looks like the levels are (randomly) shuffled...

Here I have three factors all coded identically, but the options come out in different orders... and the test statistics are also not in the first row (although they are always with what should be the first row, which might give a hint to when the shuffling occurs...)

image

I cannot share the data unfortunately.

Any idea?

aghaynes commented 4 years ago

having said, that I tried to make a reprex, and failed... I have no idea whats going on...


library(atable)
#> Warning: package 'atable' was built under R version 3.6.3
library(Hmisc)
#> Loading required package: lattice
#> Loading required package: survival
#> Warning: package 'survival' was built under R version 3.6.2
#> Loading required package: Formula
#> Loading required package: ggplot2
#> 
#> Attaching package: 'Hmisc'
#> The following objects are masked from 'package:base':
#> 
#>     format.pval, units
data(mtcars)
mtcars$gear2 <- mtcars$gear3 <- mtcars$gear4 <- factor(mtcars$gear, 3:5, c("three", "four", "five"))

mtcars$ex1 <- factor(sample(c("No", "No2", "Yes"), 32, TRUE), c("No", "No2", "Yes"), c("No", "No2", "Yes"))

mtcars$ex2 <- factor(sample(c("No", "No2", "Yes"), 32, TRUE), c("No", "No2", "Yes"), c("No", "No2", "Yes"))

mtcars$ex3 <- factor(sample(c("No", "No2", "Yes"), 32, TRUE), c("No", "No2", "Yes"), c("No", "No2", "Yes"))
label(mtcars$ex2) <- "Foo"
label(mtcars$ex3) <- "Bar"

atable(mtcars, target_cols = c("gear", "gear2", "gear3", "gear4",
                               "ex1", "ex2", "ex3"),
       blocks = list("block" = c("gear3", "gear4"),
                     Ex = c("ex1", "ex2", "ex3")),
       group_col = "am", format_to = "console")
#> Warning in stats::ks.test(x, y, alternative = c("two.sided"), ...): cannot
#> compute exact p-value with ties
#> Warning in stats::chisq.test(group, value): Chi-squared approximation may
#> be incorrect

#> Warning in stats::chisq.test(group, value): Chi-squared approximation may
#> be incorrect

#> Warning in stats::chisq.test(group, value): Chi-squared approximation may
#> be incorrect

#> Warning in stats::chisq.test(group, value): Chi-squared approximation may
#> be incorrect

#> Warning in stats::chisq.test(group, value): Chi-squared approximation may
#> be incorrect

#> Warning in stats::chisq.test(group, value): Chi-squared approximation may
#> be incorrect
#>    Group                 0          1          p      stat
#> 1   Observations                                          
#> 2                        19         13                    
#> 3   gear                                                  
#> 4        Mean (SD)       3.2 (0.42) 4.4 (0.51) <0.001 0.79
#> 5        valid (missing) 19 (0)     13 (0)                
#> 6   gear2                                                 
#> 7        three           79% (15)   0% (0)     <0.001 21  
#> 8        four            21% (4)    62% (8)               
#> 9        five            0% (0)     38% (5)               
#> 10       missing         0% (0)     0% (0)                
#> 11 block                                                  
#> 12      gear3                                             
#> 13           three       79% (15)   0% (0)     <0.001 21  
#> 14           four        21% (4)    62% (8)               
#> 15           five        0% (0)     38% (5)               
#> 16           missing     0% (0)     0% (0)                
#> 17      gear4                                             
#> 18           three       79% (15)   0% (0)     <0.001 21  
#> 19           four        21% (4)    62% (8)               
#> 20           five        0% (0)     38% (5)               
#> 21           missing     0% (0)     0% (0)                
#> 22 Ex                                                     
#> 23      ex1                                               
#> 24           No          21% (4)    23% (3)    0.94   0.13
#> 25           No2         42% (8)    46% (6)               
#> 26           Yes         37% (7)    31% (4)               
#> 27           missing     0% (0)     0% (0)                
#> 28      Foo                                               
#> 29           No          68% (13)   46% (6)    0.31   2.3 
#> 30           No2         21% (4)    23% (3)               
#> 31           Yes         11% (2)    31% (4)               
#> 32           missing     0% (0)     0% (0)                
#> 33      Bar                                               
#> 34           No          26% (5)    0% (0)     0.13   4.1 
#> 35           No2         37% (7)    46% (6)               
#> 36           Yes         37% (7)    54% (7)               
#> 37           missing     0% (0)     0% (0)                
#>    Effect Size (CI)  
#> 1                    
#> 2                    
#> 3                    
#> 4  -2.6 (-3.6; -1.6) 
#> 5                    
#> 6                    
#> 7  0.81 (0.65; 0.91) 
#> 8                    
#> 9                    
#> 10                   
#> 11                   
#> 12                   
#> 13 0.81 (0.65; 0.91) 
#> 14                   
#> 15                   
#> 16                   
#> 17                   
#> 18 0.81 (0.65; 0.91) 
#> 19                   
#> 20                   
#> 21                   
#> 22                   
#> 23                   
#> 24 0.063 (0; 0.4)    
#> 25                   
#> 26                   
#> 27                   
#> 28                   
#> 29 0.27 (0; 0.57)    
#> 30                   
#> 31                   
#> 32                   
#> 33                   
#> 34 0.36 (0.016; 0.63)
#> 35                   
#> 36                   
#> 37

Created on 2020-08-03 by the reprex package (v0.3.0)

aghaynes commented 4 years ago

The raw output (format_to = "raw") show all levels in the appropriate order...

image

aghaynes commented 4 years ago

Sorry for bombarding with messages.... Stranger still, it's not really shuffling the options randomly... Using 2 calls to atable, one with, one without a group_col, the Group variable is the same... (left is with the group_col, right is without)

image

arminstroebel commented 4 years ago

I am not sure whats happening there.

Some thoughts: your variables Dementia, Leukemia and Metastatic Cancer all have the same levels. (No, missing Yes, no previous data). Are the levels in the same order? Perhaps some clash happens internally in atable when a sort or merge is done.

When format =raw, then the stats and tests are calculated, but are not arranged in a table. Also no blocking is done in this case. So the error happens afterwards.

Does the shuffle happen, when you remove the blocks?

Do you have another block in the atable call, or just one block called General Comorbidities?

I think i can create a random data.frame with the columns as factors as above. I will look into this

arminstroebel commented 4 years ago

I tried to reproduce the issue, but failed. Output as expected:

library(atable)
get_data = function(n)factor(sample(c(1,2,3), size=n, replace = TRUE), levels = c(1,2,3), labels = c("No previous Data", "Yes", "No"))

n=750

DD = data.frame(Dementia = get_data(n),
                Metastatic_Cancer = get_data(n),
                Leukemia = get_data(n))

DD$Metastatic_Cancer = relevel(DD$Metastatic_Cancer, ref = "Yes") # put this level first

atable::atable(DD, colnames(DD),
               blocks=list("General Comorbidities" = colnames(DD)),
               format_to = "Console")
#>    Group                       value    
#> 1   Observations                        
#> 2                              750      
#> 3  General Comorbidities                
#> 4       Dementia                        
#> 5            No previous Data  35% (266)
#> 6            Yes               31% (236)
#> 7            No                33% (248)
#> 8            missing           0% (0)   
#> 9       Metastatic Cancer               
#> 10           Yes               32% (239)
#> 11           No previous Data  34% (254)
#> 12           No                34% (257)
#> 13           missing           0% (0)   
#> 14      Leukemia                        
#> 15           No previous Data  32% (241)
#> 16           Yes               34% (257)
#> 17           No                34% (252)
#> 18           missing           0% (0)

Created on 2020-08-04 by the reprex package (v0.3.0)

Order of the factors is c("No previous Data", "Yes", "No") for Dementia and Leukemia. Metastatic Cancer has label Yes as first label and "No previous Data" as second.

aghaynes commented 4 years ago

Some thoughts: your variables Dementia, Leukemia and Metastatic Cancer all have the same levels. (No, missing Yes, no previous data). Are the levels in the same order? Perhaps some clash happens internally in atable when a sort or merge is done.

Yes, all the variables are created via an apply. I even tried setting ordered to TRUE in the in factor call...

When format =raw, then the stats and tests are calculated, but are not arranged in a table. Also no blocking is done in this case. So the error happens afterwards.

Agreed

Does the shuffle happen, when you remove the blocks?

No

Do you have another block in the atable call, or just one block called General Comorbidities?

I have a second block with charlson comorbidities (yes, they overlap to a large degree, but for the time being i need both). If I include only the general comorbidities block, the dementia variable in the charlson block is also shuffled (but only that one)... image

I tried to reproduce the issue, but failed.

Me too. I have no idea whats going on here... maybe I'll have to step through the code line by line... if I find something useful, I'll let you know...

arminstroebel commented 4 years ago

So as the shuffling only happens with blocks, the new code in function atable:::indent_data_frame_with_blocks is most likely to cause the shuffling.

As you wrote, the variables "Dementia (Charlson)" is also part of your atable call, and this variable has some levels in common with variables in the other blocks (the levels 'Yes' and 'No' are common with variable "Dementia").

Could you please state your full atable call and also all levels of the variables of the data.frame. Perhpas then I can create a test data.frame to reproduce the issue.

arminstroebel commented 4 years ago

ok, so I could reproduce at least one unexpected/unintended feature/bug, see code below: The variables 'Dementia' and 'Dementia_2' both have the two levels "Yes" and "No", but in different order. 'Dementia' has 'No' before 'Yes', 'Dementia_2' has 'Yes' before 'No'. The first call of atable has 'Yes' before 'No', as only 'Dementia_2' is analysed, as expected The second call of atable has 'No' before 'Yes' for all variables, even for 'Dementia_2'. So the order of the labels is changed. This is unexpected.

This is not an issue of blocking, but of the inverted order of labels of the two factor-variables 'Dementia' and 'Dementia_2'. This has existed since the very first version of atable. atable writes all labels of of variables in one columnm and then this column is sorted (my a plyr::ddply or a merge somewhere in the code). So there is only one order of the labels. The obvious fix is to get the data right: no duplicate labels, or when they are duplicated, then the labels should be in the same order. But this may not always be possible: I am thinking of two likert-scales with the same labels and one scale is the reverse of the other. This can happen in questionaires. But this does not reproduce the issue with p-values moved one row down (see post above from Aug 5, 2020).

library(atable)
get_data_1 = function(n)factor(sample(c(1,2,3), size=n, replace = TRUE), levels = c(1,2,3), labels = c("No previous Data", "No", "Yes"))
get_data_2 = function(n)factor(sample(c(1,2,3), size=n, replace = TRUE), levels = c(1,2,3), labels = c("asdf", "Yes",  "No"))

get_data_3 = function(n)factor(sample(c(1,2), size=n, replace = TRUE), levels = c(1,2), labels = c("a","b"))

n=750

DD = data.frame(Dementia = get_data_1(n),
                Metastatic_Cancer = get_data_1(n),
                Leukemia = get_data_1(n),
                Dementia_2 = get_data_2(n),
                Metastatic_Cancer_2 = get_data_2(n),
                Leukemia_2 = get_data_2(n),
                group = get_data_3(n)
                )

atable::atable(DD,
               target_cols = c("Dementia", "Metastatic_Cancer", "Leukemia", "Dementia_2", "Metastatic_Cancer_2", "Leukemia_2"),
               group_col="group",
               blocks=list("General Comorbidities" = c("Dementia", "Metastatic_Cancer", "Leukemia"),
                           "Special Comorbidities" = c("Dementia_2", "Metastatic_Cancer_2", "Leukemia_2")),
               format_to = "Console")

atable::atable(DD,
               target_cols = c("Dementia_2", "Metastatic_Cancer_2", "Leukemia_2"),
               group_col="group",
               blocks=list("Special Comorbidities" = c("Dementia_2", "Metastatic_Cancer_2", "Leukemia_2")),
               format_to = "Console")
aghaynes commented 4 years ago

This is the code

b1 <- atable(d,
             target_cols = var_comp_info$variable_name,
             group_col = "isdead_6m", indent_character = "    ",
             blocks = list("General comorbidities" = var_comp_info$variable_name[grepl("^com.+(factor|score)$", var_comp_info$variable_name)],
                           "Charlson comorbidities" = var_comp_info$variable_name[grepl("^cci.+(factor|score)$", var_comp_info$variable_name)]
                           )
)

where var_comp_info$variable_name is

 [1] "age"                               "sex.factor"                       
 [3] "bmi"                               "place_of_living.factor"           
 [5] "com_dementia.factor"               "com_metastatic_cancer.factor"     
 [7] "com_leukemia_malign_cancer.factor" "com_lymphoma_myeloma.factor"      
 [9] "com_chron_pulmonary_dis.factor"    "com_cor_artery_dis.factor"        
[11] "com_cong_heart_failure.factor"     "com_chron_liver_dis.factor"       
[13] "com_chron_renal_dis.factor"        "com_dm_w_endorg.factor"           
[15] "com_periph_vasc_dis.factor"        "Metastases.factor"                
[17] "apache_respiration.factor"         "apache_cardiovascular.factor"     
[19] "apache_renal.factor"               "apache_liver.factor"              
[21] "apache_immunosystem.factor"        "apache_nb.factor"                 
[23] "cci_ami.factor"                    "cci_chf.factor"                   
[25] "cci_pvd.factor"                    "cci_cevd.factor"                  
[27] "cci_dementia.factor"               "cci_copd.factor"                  
[29] "cci_rheumd.factor"                 "cci_pud.factor"                   
[31] "cci_mld.factor"                    "cci_diab.factor"                  
[33] "cci_diabwc.factor"                 "cci_hp.factor"                    
[35] "cci_rend.factor"                   "cci_canc.factor"                  
[37] "cci_msld.factor"                   "cci_metacanc.factor"              
[39] "cci_aids.factor"                   "cci_score"                        
[41] "cci_index.factor"                  "cci_wscore"                       
[43] "cci_windex.factor"                 "walking.factor"                   
[45] "indoor.factor"                     "stairs.factor"                    
[47] "dressing.factor"                   "transfer.factor"                  
[49] "Dailyliving.factor"                "adl.factor"                       
[51] "hosp_adm_pre_6m.factor"            "icu_adm_pre_6m.factor"            
[53] "adm_type.factor"                   "icu_diagnosis.factor"             
[55] "icu_adm_source.factor"             "apache_II_wert_0"                 
[57] "treat_limit_during_icu_bin.factor" "patverf_at_icuadm.factor"         
[59] "saps_II_wert_0"                    "dobutamin_0.factor"               
[61] "noradr_0.factor"                   "adre_0.factor"                    
[63] "vasoactive_0.factor"               "sofa_0"                           
[65] "s_crea_wert_0"                     "agba_lac_wert_0"                  
[67] "wbc_wert_0"                        "sofa_1_nfail_gt2"  

The issue seems to be with the cci_* and com_* variables. They variables are constructed using sapply calls with the following functions (on different different sets)

# com_*
function(x){
  factor(x, -1:1, c("No previous data", "No", "Yes"), ordered = TRUE)
}

#cci_*
function(x){
  factor(x, 0:1, c("No", "Yes"))
}

As there's a lot of variables, I'll just give the summary of the variables...

summary(d)
      pid             age         sex.factor        bmi                        place_of_living.factor
 Min.   :10001   Min.   :18.00   Male  :1189   Min.   :11.05   [1] Long-term care facility: 164      
 1st Qu.:10448   1st Qu.:54.00   Female: 583   1st Qu.:22.45   [3] Home                   :1509      
 Median :10898   Median :66.00                 Median :25.50   NA's                       :  99      
 Mean   :10900   Mean   :63.02                 Mean   :26.44                                         
 3rd Qu.:11355   3rd Qu.:74.00                 3rd Qu.:29.55                                         
 Max.   :11801   Max.   :95.00                 Max.   :49.38                                         
                                               NA's   :638                                           
 com_dementia.factor com_metastatic_cancer.factor com_leukemia_malign_cancer.factor com_lymphoma_myeloma.factor
 Length:1772         Length:1772                  Length:1772                       Length:1772                
 Class1:labelled     Class1:labelled              Class1:labelled                   Class1:labelled            
 Class2:character    Class2:character             Class2:character                  Class2:character           
 Mode  :character    Mode  :character             Mode  :character                  Mode  :character           

 com_chron_pulmonary_dis.factor com_cor_artery_dis.factor com_cong_heart_failure.factor
 Length:1772                    Length:1772               Length:1772                  
 Class1:labelled                Class1:labelled           Class1:labelled              
 Class2:character               Class2:character          Class2:character             
 Mode  :character               Mode  :character          Mode  :character             

 com_chron_liver_dis.factor com_chron_renal_dis.factor com_dm_w_endorg.factor com_periph_vasc_dis.factor
 Length:1772                Length:1772                Length:1772            Length:1772               
 Class1:labelled            Class1:labelled            Class1:labelled        Class1:labelled           
 Class2:character           Class2:character           Class2:character       Class2:character          
 Mode  :character           Mode  :character           Mode  :character       Mode  :character          

 Metastases.factor apache_respiration.factor apache_cardiovascular.factor apache_renal.factor
 [0] No :1637      [0] No :1499              [0] No :1551                 [0] No :1593       
 [1] Yes:  51      [1] Yes:  96              [1] Yes:  44                 [1] Yes:  32       
 NA's   :  84      NA's   : 177              NA's   : 177                 NA's   : 147       

 apache_liver.factor apache_immunosystem.factor apache_nb.factor cci_ami.factor     cci_chf.factor    
 [0] No :1477        [0] No :1452               0   :1114        Length:1772        Length:1772       
 [1] Yes: 162        [1] Yes: 171               1-2 : 378        Class1:labelled    Class1:labelled   
 NA's   : 133        NA's   : 149               >2  :  11        Class2:character   Class2:character  
                                                NA's: 269        Mode  :character   Mode  :character  

 cci_pvd.factor     cci_cevd.factor    cci_dementia.factor cci_copd.factor    cci_rheumd.factor 
 Length:1772        Length:1772        Length:1772         Length:1772        Length:1772       
 Class1:labelled    Class1:labelled    Class1:labelled     Class1:labelled    Class1:labelled   
 Class2:character   Class2:character   Class2:character    Class2:character   Class2:character  
 Mode  :character   Mode  :character   Mode  :character    Mode  :character   Mode  :character  

 cci_pud.factor     cci_mld.factor     cci_diab.factor    cci_diabwc.factor  cci_hp.factor     
 Length:1772        Length:1772        Length:1772        Length:1772        Length:1772       
 Class1:labelled    Class1:labelled    Class1:labelled    Class1:labelled    Class1:labelled   
 Class2:character   Class2:character   Class2:character   Class2:character   Class2:character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character   Mode  :character  

 cci_rend.factor    cci_canc.factor    cci_msld.factor    cci_metacanc.factor cci_aids.factor   
 Length:1772        Length:1772        Length:1772        Length:1772         Length:1772       
 Class1:labelled    Class1:labelled    Class1:labelled    Class1:labelled     Class1:labelled   
 Class2:character   Class2:character   Class2:character   Class2:character    Class2:character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character    Mode  :character  

   cci_score     cci_index.factor   cci_wscore     cci_windex.factor                 walking.factor
 Min.   :0.000   0  :333          Min.   : 0.000   0  :333           [1] Independent        :1519  
 1st Qu.:1.000   1-2:853          1st Qu.: 1.000   1-2:563           [2] Partially dependent:  67  
 Median :2.000   3-4:474          Median : 2.000   3-4:496           [3] Fully dependent    :  58  
 Mean   :1.983   >=5:112          Mean   : 2.901   >=5:380           NA's                   : 128  
 3rd Qu.:3.000                    3rd Qu.: 4.000                                                   
 Max.   :8.000                    Max.   :14.000                                                   

                 indoor.factor                  stairs.factor                 dressing.factor
 [1] Independent        :1551   [1] Independent        :1516   [1] Independent        :1563  
 [2] Partially dependent:  64   [2] Partially dependent:  66   [2] Partially dependent:  66  
 [3] Fully dependent    :  37   [3] Fully dependent    :  60   [3] Fully dependent    :  19  
 NA's                   : 120   NA's                   : 130   NA's                   : 124  

                transfer.factor               Dailyliving.factor                   adl.factor  
 [1] Independent        :1566   [1] Independent        :1449     [1] Independent        :1403  
 [2] Partially dependent:  62   [2] Partially dependent: 210     [2] Partially dependent: 210  
 [3] Fully dependent    :  22   [3] Fully dependent    :  29     [3] Fully dependent    :  75  
 NA's                   : 122   NA's                   :  84     NA's                   :  84  

        hosp_adm_pre_6m.factor        icu_adm_pre_6m.factor           adm_type.factor
 [0] No admission  :1336       [0] No admission  :1535      Medical elective  : 205  
 [1] 1-2 admissions: 350       [1] 1-2 admissions: 220      Medical emergency :1017  
 [2] >2 admissions :  86       [2] >2 admissions :  17      Surgical elective : 222  
                                                            Surgical emergency: 328  

                           icu_diagnosis.factor                              icu_adm_source.factor
 [9] Respiratory failure             :266       [1] Ward                                :304      
 [4] Emergency sugery                :233       [2] ICU or IMC                          :166      
 [8] Other                           :209       [3] ED                                  :605      
 [7] Non-traumatic cerebral pathology:183       [4] OR, recovery room or procedure suite:601      
 [2] Cardiovascular disease          :176       [5] Other (incl. external hospital)     : 96      
 [1] Cardiac arrest                  :174                                                         
 (Other)                             :531                                                         
 apache_II_wert_0 treat_limit_during_icu_bin.factor patverf_at_icuadm.factor saps_II_wert_0  
 Min.   : 0.00    No :1313                          [0] No :1700             Min.   :  8.00  
 1st Qu.:21.00    Yes: 459                          [1] Yes:  72             1st Qu.: 48.00  
 Median :27.00                                                               Median : 62.00  
 Mean   :26.87                                                               Mean   : 61.09  
 3rd Qu.:33.00                                                               3rd Qu.: 75.00  
 Max.   :54.00                                                               Max.   :120.00  
 NA's   :313                                                                 NA's   :213     
 dobutamin_0.factor noradr_0.factor adre_0.factor  vasoactive_0.factor     sofa_0       s_crea_wert_0   
 [0] No :1519       [0] No :773     [0] No :1424   [0] No : 670        Min.   : 0.000   Min.   :  21.0  
 [1] Yes: 253       [1] Yes:999     [1] Yes: 348   [1] Yes:1102        1st Qu.: 6.000   1st Qu.:  73.0  
                                                                       Median : 8.000   Median : 104.0  
                                                                       Mean   : 8.523   Mean   : 137.4  
                                                                       3rd Qu.:11.000   3rd Qu.: 163.0  
                                                                       Max.   :20.000   Max.   :1247.0  
                                                                       NA's   :656      NA's   :360     
 agba_lac_wert_0    wbc_wert_0     sofa_1_nfail_gt2 isdead_6m 
 Min.   : 0.200   Min.   :  0.01   Mode :logical    No :1174  
 1st Qu.: 1.100   1st Qu.:  7.86   FALSE:1210       Yes: 598  
 Median : 1.900   Median : 11.40   TRUE :562                  
 Mean   : 3.065   Mean   : 12.80                              
 3rd Qu.: 3.900   3rd Qu.: 15.80                              
 Max.   :23.000   Max.   :109.00                              
 NA's   :115      NA's   :279  

It looks like the com and cci variables are not proper factors though, based on the summary... maybe thats got something to do with it...

If I restrict the atable call to the com* and cci* variables, everything is fine - all options are in the correct order... even with the blocks... so there must be some interaction with another variable... although they all seem to be be coded No/Yes....

arminstroebel commented 4 years ago

Thanks for your time! These are A LOT OF variables!

Just a quick guess: I am not sure how to create the class 'labelled' of e.g. variable 'com_metastatic_cancer.factor'. I guess the class is from package 'labelled' created with function 'to_labelled' and you are reading a SPSS or SAS file with package haven or foreign.

See the help ?haven::labelled. Quote: "This class (labelled) provides few methods, as I expect you'll coerce to a standard R class (e.g. a factor()) soon after importing. Unfortunately it's not possible to make as.factor work for labelled objects so instead use as_factor. This works for all types of labelled vectors."

So perhaps use haven::as_factor() instead of factor() of the base package on you data: d <- haven::as_factor(d) Then summary(d) should show proper factors. And atable should hopefully act as expected.

Internally atable calls statistics() on every target_col. statistics() has no method for class labelled, but it has one for class character. And statistics.character() also calls factor().

arminstroebel commented 4 years ago

One more thought:

There could be duplicated aliases.

Run this code on your data.frame d with all variables:

Alias_mapping = atable::create_alias_mapping(d)
b = duplicated(Alias_mapping$new, fromLast = TRUE) | duplicated(Alias_mapping$new, fromLast = FALSE)

b should be all FALSE.

Show duplicated aliases: Alias_mapping[b, ] This must be empty! Or else atable will mix up the variables in the output. This is a possible explanation of the shuffling.

Currently atable does not check this kind of name clash. I think, I will add this check in the next version of atable.

aghaynes commented 4 years ago

I am not sure how to create the class 'labelled' of e.g. variable 'com_metastatic_cancer.factor'. I guess the class is from package 'labelled' created with function 'to_labelled' and you are reading a SPSS or SAS file with package haven or foreign.

Nope, an xlsx.

comorb1 <- readxl::read_xlsx(file.path(or, "comorbidities for Alan.xlsx"), sheet = 2)
# at this point its a set of numeric variables
comorb1a <- as.data.frame(sapply(comorb1[, 4:ncol(comorb1)], function(x){
  factor(x, -1:1, c("No previous data", "No", "Yes"), ordered = TRUE)
}))
names(comorb1a) <- paste0(names(comorb1a), ".factor")
comorb1 <- cbind(comorb1, comorb1a)

But, I see now that the sapply doesn't return factors, but characters... (as opposed to a small test that did, where it returned factors)

> str(comorb1a)
'data.frame':   2010 obs. of  11 variables:
 $ dementia              : chr  "No previous data" "No" "No" "No" ...
 $ metastatic_cancer     : chr  "No previous data" "No" "Yes" "No" ...
 $ leukemia_malign_cancer: chr  "No previous data" "No" "Yes" "No" ...
 $ lymphoma_myeloma      : chr  "No previous data" "No" "No" "No" ...
 $ chron_pulmonary_dis   : chr  "No previous data" "No" "No" "No" ...
 $ cor_artery_dis        : chr  "No previous data" "No" "No" "No" ...
 $ cong_heart_failure    : chr  "No previous data" "No" "No" "No" ...
 $ chron_liver_dis       : chr  "No previous data" "No" "No" "No" ...
 $ chron_renal_dis       : chr  "No previous data" "No" "No" "No" ...
 $ dm_w_endorg           : chr  "No previous data" "No" "No" "No" ...
 $ periph_vasc_dis       : chr  "No previous data" "No" "No" "No" ...

The labelled part comes from hmisc::label, which you implemented as an alias, no?

Show duplicated aliases: Alias_mapping[b, ] This must be empty! Or else atable will mix up the variables in the output. This is a possible explanation of the shuffling.

This is indeed empty... 4 variables, no observations.

arminstroebel commented 4 years ago

So this gets even more confusing.

Yes, I added support for class labelled of the Hmisc-Package. The call of atable::create_alias_mapping(d) should return the aliases that you defined.

Lets do a real Minimal Working Example: as you cannot share your data, we need some other way to reproduce the shuffling.

atable does not need the full data.frame, it just need the classes of the columns.

So when d is the data.frame, that produces the shuffling in atable, you can create an empty data.frame with the same column classes by e <- d[FALSE, ]

When you now call atable on e, you should get the same shuffling as for d. This should work when all columns are factors. It does not work with characters columns. Also the aliases should be preserved in e.

Can you save() and send me this data.frame e (via Mail, or perhaps via this GitHub here)?

aghaynes commented 4 years ago

I came to the same conclusion on Monday and I sent you an R script via email (web.de account). Ca 11:30am. 😃

arminstroebel commented 4 years ago

Got the file and could reproduce the shuffle. Let the search beginn!

aghaynes commented 4 years ago

great! good luck! 😄

arminstroebel commented 4 years ago

I was able to fix something: The row containing p-values and test statistics is now the first of every variable. This fixes the shuffling with 'Dementia (Charlson)' in post: https://github.com/arminstroebel/atable/issues/7#issuecomment-669034120

I will upload the fix to CRAN as atable version 0.1.8. This version contains other fixes aswell. This shuffling with 'Dementia (Charlson) occurs with atable version 0.1.7, but not with 0.1.8.

What I was not able to fix:

The order of the levels of the first target_col overwrites the order of the other target cols in the atable-output, when the target_cols share some labels. This happens with and without blocking. And also with and without group_col.

Below is an example to demonstrate this kind of shuffling. The variable f1 has labelsA, B, C, D. The variable f2 has labelsD, C, B, A, so in reversed order. f3 hat some labels in comom with f1, but not all.

Calling atable with f1 as first target_cols will order the labels as A, B, C, D, as f1 is the first target_col. Calling atable with f2 as first target_cols will order the labels as D, C, B, A, as f2 is the first target_col.

Above is another example: https://github.com/arminstroebel/atable/issues/7#issuecomment-670446940

Internally atable stores all labels of all target cols in one column of a data.frame. This column is a factor and the labels are c()-ed together. Then a sort of this column happens implicitly by plyr::ddplyr or explicitly by doBy::orderBy() or merge() I presume, changing this needs a bigger rewrite of the package.


Example:
# create factors with colliding labels
library(atable)
atable_options(format_to="Console")

get_data = function(x, labels)factor(sample(c(1:length(labels)), size=n, replace = TRUE), levels = 1:length(labels), labels = labels)

n=42

DD = data.frame(f1 = get_data(n, c("A", "B", "C", "D")),
                f2 = get_data(n, c("D", "C", "B", "A")),
                f3 = get_data(n, c("F", "B", "C", "G")),
                group = get_data(n, c("a","b"))
                )

# order is A < B < C < D for all variables f1, f2 and f3:
atable::atable(DD,
               target_cols = c("f1", "f2", "f3"),
               group_col="group")

# order is D < C < B < A:
atable::atable(DD,
               target_cols = c("f2", "f1", "f3"),
               group_col="group")

# order is C < B < E < F:
atable::atable(DD,
               target_cols = c("f3", "f1", "f2"),
               group_col="group")
aghaynes commented 4 years ago

at least part of the puzzle is fixed and the route cause seems to have been found... thanks!

Internally atable stores all labels of all target cols in one column of a data.frame. This column is a factor and the labels are c()-ed together. Then a sort of this column happens implicitly by plyr::ddplyr or explicitly by doBy::orderBy() or merge()

I think i mentioned before at some point, maybe it's safer/easier to keep a 2 variable dataframe... or include the variable in the factor level and parse it away later