group_by does not work with dfSummary

DeFilippis commented 5 years ago

I am trying to get a dfSummary() by a specific grouping variable in a dplyr chain. For example:

load(mtcars)
mtcars %>% group_by(cyl) %>% dfSummary()

However, in the R console, this outputs the summary statistics for the ungrouped data. As far as I can tell from the documentation, this is the correct way of doing this. What am I doing wrong?

dcomtois commented 4 years ago

Hello,

Sorry for late reply. You have the correct syntax. What OS are you using, and what summarytools version do you have? I've tested it using various configurations and results show correctly here (using iris for brevity):

> iris %>% group_by(Species) %>% dfSummary(na.col = FALSE)
Data Frame Summary  
iris  
Group: Species = setosa  
Dimensions: 50 x 5  
Duplicates: 0  

------------------------------------------------------------------------------------------------
No   Variable       Stats / Values          Freqs (% of Valid)   Graph                  Valid   
---- -------------- ----------------------- -------------------- ---------------------- --------
1    Sepal.Length   Mean (sd) : 5 (0.4)     15 distinct values         : .              50      
     [numeric]      min < med < max:                                   : :              (100%)  
                    4.3 < 5 < 5.8                                    : : : .                    
                    IQR (CV) : 0.4 (0.1)                         . : : : : :                    
                                                                 : : : : : : : :                

2    Sepal.Width    Mean (sd) : 3.4 (0.4)   16 distinct values       :                  50      
     [numeric]      min < med < max:                                 :                  (100%)  
                    2.3 < 3.4 < 4.4                                  : .                        
                    IQR (CV) : 0.5 (0.1)                           . : :                        
                                                                   : : : .                      

3    Petal.Length   Mean (sd) : 1.5 (0.2)   1.00 :  1 ( 2.0%)                           50      
     [numeric]      min < med < max:        1.10 :  1 ( 2.0%)                           (100%)  
                    1 < 1.5 < 1.9           1.20 :  2 ( 4.0%)                                   
                    IQR (CV) : 0.2 (0.1)    1.30 :  7 (14.0%)    II                             
                                            1.40 : 13 (26.0%)    IIIII                          
                                            1.50 : 13 (26.0%)    IIIII                          
                                            1.60 :  7 (14.0%)    II                             
                                            1.70 :  4 ( 8.0%)    I                              
                                            1.90 :  2 ( 4.0%)                                   

4    Petal.Width    Mean (sd) : 0.2 (0.1)   0.10 :  5 (10.0%)    II                     50      
     [numeric]      min < med < max:        0.20 : 29 (58.0%)    IIIIIIIIIII            (100%)  
                    0.1 < 0.2 < 0.6         0.30 :  7 (14.0%)    II                             
                    IQR (CV) : 0.1 (0.4)    0.40 :  7 (14.0%)    II                             
                                            0.50 :  1 ( 2.0%)                                   
                                            0.60 :  1 ( 2.0%)                                   

5    Species        1. setosa               50 (100.0%)          IIIIIIIIIIIIIIIIIIII   50      
     [factor]       2. versicolor            0 (  0.0%)                                 (100%)  
                    3. virginica             0 (  0.0%)                                         
------------------------------------------------------------------------------------------------

Group: Species = versicolor  
Dimensions: 50 x 5  
Duplicates: 0  

------------------------------------------------------------------------------------------------
No   Variable       Stats / Values          Freqs (% of Valid)   Graph                  Valid   
---- -------------- ----------------------- -------------------- ---------------------- --------
1    Sepal.Length   Mean (sd) : 5.9 (0.5)   21 distinct values       :                  50      
     [numeric]      min < med < max:                                 :                  (100%)  
                    4.9 < 5.9 < 7                                    : :                        
                    IQR (CV) : 0.7 (0.1)                           : : : :                      
                                                                 : : : : :                      

2    Sepal.Width    Mean (sd) : 2.8 (0.3)   14 distinct values           :              50      
     [numeric]      min < med < max:                                   . :              (100%)  
                    2 < 2.8 < 3.4                                    . : :                      
                    IQR (CV) : 0.5 (0.1)                           : : : : :                    
                                                                 : : : : : : .                  

3    Petal.Length   Mean (sd) : 4.3 (0.5)   19 distinct values       :                  50      
     [numeric]      min < med < max:                                 :                  (100%)  
                    3 < 4.3 < 5.1                                  : : :                        
                    IQR (CV) : 0.6 (0.1)                           : : :                        
                                                                 : : : :                        

4    Petal.Width    Mean (sd) : 1.3 (0.2)   1.00 :  7 (14.0%)    II                     50      
     [numeric]      min < med < max:        1.10 :  3 ( 6.0%)    I                      (100%)  
                    1 < 1.3 < 1.8           1.20 :  5 (10.0%)    II                             
                    IQR (CV) : 0.3 (0.1)    1.30 : 13 (26.0%)    IIIII                          
                                            1.40 :  7 (14.0%)    II                             
                                            1.50 : 10 (20.0%)    IIII                           
                                            1.60 :  3 ( 6.0%)    I                              
                                            1.70 :  1 ( 2.0%)                                   
                                            1.80 :  1 ( 2.0%)                                   

5    Species        1. setosa                0 (  0.0%)                                 50      
     [factor]       2. versicolor           50 (100.0%)          IIIIIIIIIIIIIIIIIIII   (100%)  
                    3. virginica             0 (  0.0%)                                         
------------------------------------------------------------------------------------------------

Group: Species = virginica  
Dimensions: 50 x 5  
Duplicates: 1  

------------------------------------------------------------------------------------------------
No   Variable       Stats / Values          Freqs (% of Valid)   Graph                  Valid   
---- -------------- ----------------------- -------------------- ---------------------- --------
1    Sepal.Length   Mean (sd) : 6.6 (0.6)   21 distinct values         :                50      
     [numeric]      min < med < max:                                   :                (100%)  
                    4.9 < 6.5 < 7.9                                    : .                      
                    IQR (CV) : 0.7 (0.1)                             : : : . .                  
                                                                 .   : : : : :                  

2    Sepal.Width    Mean (sd) : 3 (0.3)     13 distinct values       . :                50      
     [numeric]      min < med < max:                                 : :                (100%)  
                    2.2 < 3 < 3.8                                    : : :                      
                    IQR (CV) : 0.4 (0.1)                           : : : : :                    
                                                                 . : : : : : . .                

3    Petal.Length   Mean (sd) : 5.6 (0.6)   20 distinct values     : :                  50      
     [numeric]      min < med < max:                               : :                  (100%)  
                    4.5 < 5.5 < 6.9                              : : :                          
                    IQR (CV) : 0.8 (0.1)                         : : : .                        
                                                                 : : : : :                      

4    Petal.Width    Mean (sd) : 2 (0.3)     12 distinct values     : .   .              50      
     [numeric]      min < med < max:                               : : : :              (100%)  
                    1.4 < 2 < 2.5                                  : : : :                      
                    IQR (CV) : 0.5 (0.1)                         . : : : :                      
                                                                 : : : : : :                    

5    Species        1. setosa                0 (  0.0%)                                 50      
     [factor]       2. versicolor            0 (  0.0%)                                 (100%)  
                    3. virginica            50 (100.0%)          IIIIIIIIIIIIIIIIIIII           
------------------------------------------------------------------------------------------------

gustavobrp commented 4 years ago

Hello Dominic, I'm facing a problem with the same function. I tried the following command:

summary_data<- data %>% group_by(variable) %>% dfSummary(na.col = FALSE, max.distinct.values = 8, max.string.width = 100, split.tables = 50, round.digits = 2, varnumbers = F, style = 'multiline')

Also tried to select some variables of the data, something that worked when I try to use dfSummary without grouping:

summary_data <- subset(data, select = c(91,5:7,9:46)) %>% group_by(variable) %>% dfSummary(na.col = FALSE, max.distinct.values = 8, max.string.width = 100, split.tables = 50, round.digits = 2, varnumbers = F, style = 'multiline')

This is what I get when I try to use view() to see the results, one table is inside the other.

Thanks a lot.

dcomtois commented 4 years ago

@gustavobrp I've seen this occur when data contain non-ascii characters, but it's not systematic... Is this sensitive data, or could you send it so I can dig further?

Also you could try installing from github using remotes or devtools:: install_github and see if you get better results.

Thx

gustavobrp commented 4 years ago

Hi Dominic! Thanks for replying.

Unfortunately is a sensitive data, so I could't send it. But I will try to replicate this in other dataset.

I will follow your suggestion and see what can I get.

Thanks a lot!

dcomtois commented 4 years ago

Hi, I think it should work fine now. If you get a chance, pls let me know if you can confirm. Thx!

(as usual, test using devtools or remotes to install "dev-current" branch.)

gustavobrp commented 4 years ago

Hey Dominic,

I tried here and could't get to work at my end. I updated the package using remotes.

Here is the head of the data set that I used to test, and the way that I tested the code. Also, there is some problem with the encoding of some characters. Do you have some suggestion of what I can change in the code?

sumario <- bd.subset_amostra %>% 
  group_by(dep_ORIGEM_ESCOLAR) %>% 
  dfSummary()

view(sumario)

> head(bd.subset_amostra)
  COD_IES_2016 COD_CURSO_2016  dep_SEXO dep_IDADE    dep_IDADE_FAIXA dep_GRAU_CURSO
1          585          14220 Masculino        18        Até 18 anos    Bacharelado
2           17         115800  Feminino        18        Até 18 anos    Bacharelado
3          584          90202 Masculino        26 Entre 25 e 29 anos    Bacharelado
4         1082         103368  Feminino        23 Entre 19 e 24 anos    Tecnológico
5          789          97073  Feminino        19 Entre 19 e 24 anos    Bacharelado
6          548         105440 Masculino        18        Até 18 anos    Bacharelado
  dep_TURNO_CURSO                         dep_EXTRACURRICULAR     dep_APOIO_SOCIAL
1          Diurno Não participou de atividade extracurricular          Não recebeu
2        Integral Não participou de atividade extracurricular          Não recebeu
3         Noturno Não participou de atividade extracurricular          Não recebeu
4          Diurno Não participou de atividade extracurricular Recebeu apoio social
5        Integral Não participou de atividade extracurricular          Não recebeu
6          Diurno Não participou de atividade extracurricular          Não recebeu
  dep_ORGANIZACAO_ACADEMICA dep_COR_RACA          dep_COTAS dep_ORIGEM_ESCOLAR
1              Universidade       Branca Ampla concorrência     Escola privada
2              Universidade       Branca Ampla concorrência     Escola pública
3              Universidade       Branca            Cotista     Escola pública
4          Insituto Federal        Preta Ampla concorrência     Escola privada
5              Universidade        Parda Ampla concorrência     Escola pública
6              Universidade        Preta            Cotista     Escola pública
       ind_EVASAO_BIN ind_EVASAO_TER                      MUDANCA_completa SITUACAO_2016
1 Permaneceu no curso       Cursando Mudança de curso, instituição e setor      Cursando
2 Permaneceu no curso       Cursando                              Cursando      Cursando
3 Permaneceu no curso       Cursando                              Cursando      Cursando
4 Permaneceu no curso       Cursando                              Cursando      Cursando
5 Permaneceu no curso       Cursando                              Cursando      Cursando
6 Permaneceu no curso       Cursando                              Cursando      Cursando
                SITUACAO_2017 COD_IES_CURSO_2016      CO_ALUNO ind_EVASAO_BIN_ES
1 Evasão do curso/instituição          585314E02 1.976371e-312        Não evadiu
2                    Cursando           17726F01 1.976374e-312        Não evadiu
3                    Cursando          584344C02 1.976376e-312        Não evadiu
4                    Cursando         1082582C05 1.976325e-312        Não evadiu
5                    Cursando          789581A05 1.976375e-312        Não evadiu
6                    Cursando          548345A01 1.976374e-312        Não evadiu
  ind_EVASAO_BIN_CI
1        Não evadiu
2        Não evadiu
3        Não evadiu
4        Não evadiu
5        Não evadiu
6        Não evadiu

This information from this data set is public, so I can send to you if you want to take a look. I made a sample from it. I left in portuguese, because it could help to understand the problem.

Thanks a lot.

dcomtois commented 4 years ago

Thanks for testing it @gustavobrp . It seems to be happening when accentuated characters are present. I'm looking for a way to re-encode them without causing other problems :)

I'll keep you posted.

gustavobrp commented 4 years ago

Ok, if you need more testing, let me know.

dcomtois commented 4 years ago

Obrigado :) I think I got it working now, can you try again after reinstalling pls?

gustavobrp commented 4 years ago

Hi Dominic, so I updated the package and also pander before trying, but didn't worked.

devtools::install_github("dcomtois/summarytools", ref = "dev-current")

devtools::install_github('rapporter/pander')

Tried again with the same data set and got same previously results... When is not grouped, it's ok, despite the problems with the accentuated characters that you commented.

sumario <- bd.subset %>% 
  group_by(dep_GRAU_CURSO) %>% 
  dfSummary()

summarytools::view(sumario)

dcomtois commented 4 years ago

@gustavobrp Thx for the follow-up. Using the partial data you posted earlier, results are good on my end, so I'm not sure what to look for next. Could you post the source of the html file generated? Thx again

gustavobrp commented 4 years ago

Weird. Maybe I'm doing something wrong or is some problem with my R program?

The source code is big, so I uploaded a txt file for you. But if you want a better way to take a look, let me know.

html source.txt

dcomtois commented 4 years ago

Thanks @gustavobrp. It's hard to say for sure, but I suspect your R session still had the former version of summarytools loaded, as the "Até" was not encoded as "Até", as the last update should be doing. If you don't mind, I'd suggest removing summarytools altogether, then close and restart RStudio or RGui, or whichever interface you're using, reinstalling (with ref="dev-current"), load the package and check again. If it still produces the same results then we'll see what can be done next to investigate further. Thanks again for your help and patience :)

gustavobrp commented 4 years ago

Hey, no problem at all.

So, I deleted the package and still have this problem. But now these messages are showing after I run the code. Maybe something related to why I can't get it working?

Also, it seems that with reduced number of variables the html file works fine, but when is more above 3, the html output breaks again.

About the characters, I tried to change the language in st_options() to pt, but the results were the same.

> sumario <- bd.subset %>% 
+   group_by(dep_GRAU_CURSO) %>% 
+   dfSummary()
Warning messages:
1: In pretty.default(range(data), n = min(nclass.Sturges(data), 250),  :
  Internal(pretty()): very small range.. corrected
2: In pretty.default(range(data), n = nclass.Sturges(data), min.n = 1) :
  Internal(pretty()): very small range.. corrected
3: In pretty.default(range(data), n = min(nclass.Sturges(data), 250),  :
  Internal(pretty()): very small range.. corrected
4: In pretty.default(range(data), n = nclass.Sturges(data), min.n = 1) :
  Internal(pretty()): very small range.. corrected
5: In pretty.default(range(data), n = min(nclass.Sturges(data), 250),  :
  Internal(pretty()): very small range.. corrected
6: In pretty.default(range(data), n = nclass.Sturges(data), min.n = 1) :
  Internal(pretty()): very small range.. corrected
> summarytools::view(sumario)
Output file written: C:\Users\GUSTAV~1\AppData\Local\Temp\RtmpIF1RLf\file1e1030423fde.html
Output file appended: C:\Users\Gustavo Bruno\AppData\Local\Temp\RtmpIF1RLf\file1e1030423fde.html
Warning message:
In readLines(f, warn = FALSE, encoding = "utf-8") :
  invalid input found on input connection 'C:\Users\GUSTAV~1\AppData\Local\Temp\RtmpIF1RLf\file1e1030423fde.html'

dcomtois commented 4 years ago

@gustavobrp Thanks again for the follow-up. This is quite puzzling. Would there be a way for you to share the data (privately if necessary) after making sure everything is anonymized? Also, and sorry if I did ask you that before on another thread (I'm not sure), could you tell me what system you're running?

gustavobrp commented 4 years ago

Sure, I can send a reduced sample from it. Can I send to your email?

I'm running Windows 10, RStudio 1.2.5033 and R 3.6.2.

dcomtois commented 4 years ago

Great! Just make sure the problem is still there using that sample. Email is fine (dominic.comtois, gmail).

dcomtois commented 4 years ago

@gustavobrp Thanks for sending the data. Clearly we have yet another case of encoding headaches. I don't have a fix yet, but here are two alternatives to make it work in the meantime:

Import the csv using read.csv with the encoding = 'latin1' parameter or

Use this loop to set the encoding value for all factors:

for (i in seq_along(datos))
if (is.factor(datos[[i]]))
Encoding(levels(datos[[i]])) <- "latin1"

Let me know if that works for you!

gustavobrp commented 4 years ago

Oh I see!

Ok, so the first alternative worked! The group_by and also the characters strings.

Thanks a lot for the help!

dcomtois / summarytools

group_by does not work with dfSummary #100