Closed DeFilippis closed 3 years ago
Hello,
Sorry for late reply. You have the correct syntax. What OS are you using, and what summarytools version do you have? I've tested it using various configurations and results show correctly here (using iris for brevity):
> iris %>% group_by(Species) %>% dfSummary(na.col = FALSE)
Data Frame Summary
iris
Group: Species = setosa
Dimensions: 50 x 5
Duplicates: 0
------------------------------------------------------------------------------------------------
No Variable Stats / Values Freqs (% of Valid) Graph Valid
---- -------------- ----------------------- -------------------- ---------------------- --------
1 Sepal.Length Mean (sd) : 5 (0.4) 15 distinct values : . 50
[numeric] min < med < max: : : (100%)
4.3 < 5 < 5.8 : : : .
IQR (CV) : 0.4 (0.1) . : : : : :
: : : : : : : :
2 Sepal.Width Mean (sd) : 3.4 (0.4) 16 distinct values : 50
[numeric] min < med < max: : (100%)
2.3 < 3.4 < 4.4 : .
IQR (CV) : 0.5 (0.1) . : :
: : : .
3 Petal.Length Mean (sd) : 1.5 (0.2) 1.00 : 1 ( 2.0%) 50
[numeric] min < med < max: 1.10 : 1 ( 2.0%) (100%)
1 < 1.5 < 1.9 1.20 : 2 ( 4.0%)
IQR (CV) : 0.2 (0.1) 1.30 : 7 (14.0%) II
1.40 : 13 (26.0%) IIIII
1.50 : 13 (26.0%) IIIII
1.60 : 7 (14.0%) II
1.70 : 4 ( 8.0%) I
1.90 : 2 ( 4.0%)
4 Petal.Width Mean (sd) : 0.2 (0.1) 0.10 : 5 (10.0%) II 50
[numeric] min < med < max: 0.20 : 29 (58.0%) IIIIIIIIIII (100%)
0.1 < 0.2 < 0.6 0.30 : 7 (14.0%) II
IQR (CV) : 0.1 (0.4) 0.40 : 7 (14.0%) II
0.50 : 1 ( 2.0%)
0.60 : 1 ( 2.0%)
5 Species 1. setosa 50 (100.0%) IIIIIIIIIIIIIIIIIIII 50
[factor] 2. versicolor 0 ( 0.0%) (100%)
3. virginica 0 ( 0.0%)
------------------------------------------------------------------------------------------------
Group: Species = versicolor
Dimensions: 50 x 5
Duplicates: 0
------------------------------------------------------------------------------------------------
No Variable Stats / Values Freqs (% of Valid) Graph Valid
---- -------------- ----------------------- -------------------- ---------------------- --------
1 Sepal.Length Mean (sd) : 5.9 (0.5) 21 distinct values : 50
[numeric] min < med < max: : (100%)
4.9 < 5.9 < 7 : :
IQR (CV) : 0.7 (0.1) : : : :
: : : : :
2 Sepal.Width Mean (sd) : 2.8 (0.3) 14 distinct values : 50
[numeric] min < med < max: . : (100%)
2 < 2.8 < 3.4 . : :
IQR (CV) : 0.5 (0.1) : : : : :
: : : : : : .
3 Petal.Length Mean (sd) : 4.3 (0.5) 19 distinct values : 50
[numeric] min < med < max: : (100%)
3 < 4.3 < 5.1 : : :
IQR (CV) : 0.6 (0.1) : : :
: : : :
4 Petal.Width Mean (sd) : 1.3 (0.2) 1.00 : 7 (14.0%) II 50
[numeric] min < med < max: 1.10 : 3 ( 6.0%) I (100%)
1 < 1.3 < 1.8 1.20 : 5 (10.0%) II
IQR (CV) : 0.3 (0.1) 1.30 : 13 (26.0%) IIIII
1.40 : 7 (14.0%) II
1.50 : 10 (20.0%) IIII
1.60 : 3 ( 6.0%) I
1.70 : 1 ( 2.0%)
1.80 : 1 ( 2.0%)
5 Species 1. setosa 0 ( 0.0%) 50
[factor] 2. versicolor 50 (100.0%) IIIIIIIIIIIIIIIIIIII (100%)
3. virginica 0 ( 0.0%)
------------------------------------------------------------------------------------------------
Group: Species = virginica
Dimensions: 50 x 5
Duplicates: 1
------------------------------------------------------------------------------------------------
No Variable Stats / Values Freqs (% of Valid) Graph Valid
---- -------------- ----------------------- -------------------- ---------------------- --------
1 Sepal.Length Mean (sd) : 6.6 (0.6) 21 distinct values : 50
[numeric] min < med < max: : (100%)
4.9 < 6.5 < 7.9 : .
IQR (CV) : 0.7 (0.1) : : : . .
. : : : : :
2 Sepal.Width Mean (sd) : 3 (0.3) 13 distinct values . : 50
[numeric] min < med < max: : : (100%)
2.2 < 3 < 3.8 : : :
IQR (CV) : 0.4 (0.1) : : : : :
. : : : : : . .
3 Petal.Length Mean (sd) : 5.6 (0.6) 20 distinct values : : 50
[numeric] min < med < max: : : (100%)
4.5 < 5.5 < 6.9 : : :
IQR (CV) : 0.8 (0.1) : : : .
: : : : :
4 Petal.Width Mean (sd) : 2 (0.3) 12 distinct values : . . 50
[numeric] min < med < max: : : : : (100%)
1.4 < 2 < 2.5 : : : :
IQR (CV) : 0.5 (0.1) . : : : :
: : : : : :
5 Species 1. setosa 0 ( 0.0%) 50
[factor] 2. versicolor 0 ( 0.0%) (100%)
3. virginica 50 (100.0%) IIIIIIIIIIIIIIIIIIII
------------------------------------------------------------------------------------------------
Hello Dominic, I'm facing a problem with the same function. I tried the following command:
summary_data<- data %>% group_by(variable) %>% dfSummary(na.col = FALSE, max.distinct.values = 8, max.string.width = 100, split.tables = 50, round.digits = 2, varnumbers = F, style = 'multiline')
Also tried to select some variables of the data, something that worked when I try to use dfSummary without grouping:
summary_data <- subset(data, select = c(91,5:7,9:46)) %>% group_by(variable) %>% dfSummary(na.col = FALSE, max.distinct.values = 8, max.string.width = 100, split.tables = 50, round.digits = 2, varnumbers = F, style = 'multiline')
This is what I get when I try to use view() to see the results, one table is inside the other.
Thanks a lot.
@gustavobrp I've seen this occur when data contain non-ascii characters, but it's not systematic... Is this sensitive data, or could you send it so I can dig further?
Also you could try installing from github using remotes
or devtools:: install_github
and see if you get better results.
Thx
Hi Dominic! Thanks for replying.
Unfortunately is a sensitive data, so I could't send it. But I will try to replicate this in other dataset.
I will follow your suggestion and see what can I get.
Thanks a lot!
Hi, I think it should work fine now. If you get a chance, pls let me know if you can confirm. Thx!
(as usual, test using devtools or remotes to install "dev-current" branch.)
Hey Dominic,
I tried here and could't get to work at my end. I updated the package using remotes.
Here is the head of the data set that I used to test, and the way that I tested the code. Also, there is some problem with the encoding of some characters. Do you have some suggestion of what I can change in the code?
sumario <- bd.subset_amostra %>%
group_by(dep_ORIGEM_ESCOLAR) %>%
dfSummary()
view(sumario)
> head(bd.subset_amostra)
COD_IES_2016 COD_CURSO_2016 dep_SEXO dep_IDADE dep_IDADE_FAIXA dep_GRAU_CURSO
1 585 14220 Masculino 18 Até 18 anos Bacharelado
2 17 115800 Feminino 18 Até 18 anos Bacharelado
3 584 90202 Masculino 26 Entre 25 e 29 anos Bacharelado
4 1082 103368 Feminino 23 Entre 19 e 24 anos Tecnológico
5 789 97073 Feminino 19 Entre 19 e 24 anos Bacharelado
6 548 105440 Masculino 18 Até 18 anos Bacharelado
dep_TURNO_CURSO dep_EXTRACURRICULAR dep_APOIO_SOCIAL
1 Diurno Não participou de atividade extracurricular Não recebeu
2 Integral Não participou de atividade extracurricular Não recebeu
3 Noturno Não participou de atividade extracurricular Não recebeu
4 Diurno Não participou de atividade extracurricular Recebeu apoio social
5 Integral Não participou de atividade extracurricular Não recebeu
6 Diurno Não participou de atividade extracurricular Não recebeu
dep_ORGANIZACAO_ACADEMICA dep_COR_RACA dep_COTAS dep_ORIGEM_ESCOLAR
1 Universidade Branca Ampla concorrência Escola privada
2 Universidade Branca Ampla concorrência Escola pública
3 Universidade Branca Cotista Escola pública
4 Insituto Federal Preta Ampla concorrência Escola privada
5 Universidade Parda Ampla concorrência Escola pública
6 Universidade Preta Cotista Escola pública
ind_EVASAO_BIN ind_EVASAO_TER MUDANCA_completa SITUACAO_2016
1 Permaneceu no curso Cursando Mudança de curso, instituição e setor Cursando
2 Permaneceu no curso Cursando Cursando Cursando
3 Permaneceu no curso Cursando Cursando Cursando
4 Permaneceu no curso Cursando Cursando Cursando
5 Permaneceu no curso Cursando Cursando Cursando
6 Permaneceu no curso Cursando Cursando Cursando
SITUACAO_2017 COD_IES_CURSO_2016 CO_ALUNO ind_EVASAO_BIN_ES
1 Evasão do curso/instituição 585314E02 1.976371e-312 Não evadiu
2 Cursando 17726F01 1.976374e-312 Não evadiu
3 Cursando 584344C02 1.976376e-312 Não evadiu
4 Cursando 1082582C05 1.976325e-312 Não evadiu
5 Cursando 789581A05 1.976375e-312 Não evadiu
6 Cursando 548345A01 1.976374e-312 Não evadiu
ind_EVASAO_BIN_CI
1 Não evadiu
2 Não evadiu
3 Não evadiu
4 Não evadiu
5 Não evadiu
6 Não evadiu
This information from this data set is public, so I can send to you if you want to take a look. I made a sample from it. I left in portuguese, because it could help to understand the problem.
Thanks a lot.
Thanks for testing it @gustavobrp . It seems to be happening when accentuated characters are present. I'm looking for a way to re-encode them without causing other problems :)
I'll keep you posted.
Ok, if you need more testing, let me know.
Obrigado :) I think I got it working now, can you try again after reinstalling pls?
Hi Dominic, so I updated the package and also pander before trying, but didn't worked.
devtools::install_github("dcomtois/summarytools", ref = "dev-current")
devtools::install_github('rapporter/pander')
Tried again with the same data set and got same previously results... When is not grouped, it's ok, despite the problems with the accentuated characters that you commented.
sumario <- bd.subset %>%
group_by(dep_GRAU_CURSO) %>%
dfSummary()
summarytools::view(sumario)
@gustavobrp Thx for the follow-up. Using the partial data you posted earlier, results are good on my end, so I'm not sure what to look for next. Could you post the source of the html file generated? Thx again
Weird. Maybe I'm doing something wrong or is some problem with my R program?
The source code is big, so I uploaded a txt file for you. But if you want a better way to take a look, let me know.
Thanks @gustavobrp. It's hard to say for sure, but I suspect your R session still had the former version of summarytools loaded, as the "Até" was not encoded as "Até", as the last update should be doing. If you don't mind, I'd suggest removing summarytools altogether, then close and restart RStudio or RGui, or whichever interface you're using, reinstalling (with ref="dev-current"), load the package and check again. If it still produces the same results then we'll see what can be done next to investigate further. Thanks again for your help and patience :)
Hey, no problem at all.
So, I deleted the package and still have this problem. But now these messages are showing after I run the code. Maybe something related to why I can't get it working?
Also, it seems that with reduced number of variables the html file works fine, but when is more above 3, the html output breaks again.
About the characters, I tried to change the language in st_options()
to pt
, but the results were the same.
> sumario <- bd.subset %>%
+ group_by(dep_GRAU_CURSO) %>%
+ dfSummary()
Warning messages:
1: In pretty.default(range(data), n = min(nclass.Sturges(data), 250), :
Internal(pretty()): very small range.. corrected
2: In pretty.default(range(data), n = nclass.Sturges(data), min.n = 1) :
Internal(pretty()): very small range.. corrected
3: In pretty.default(range(data), n = min(nclass.Sturges(data), 250), :
Internal(pretty()): very small range.. corrected
4: In pretty.default(range(data), n = nclass.Sturges(data), min.n = 1) :
Internal(pretty()): very small range.. corrected
5: In pretty.default(range(data), n = min(nclass.Sturges(data), 250), :
Internal(pretty()): very small range.. corrected
6: In pretty.default(range(data), n = nclass.Sturges(data), min.n = 1) :
Internal(pretty()): very small range.. corrected
> summarytools::view(sumario)
Output file written: C:\Users\GUSTAV~1\AppData\Local\Temp\RtmpIF1RLf\file1e1030423fde.html
Output file appended: C:\Users\Gustavo Bruno\AppData\Local\Temp\RtmpIF1RLf\file1e1030423fde.html
Warning message:
In readLines(f, warn = FALSE, encoding = "utf-8") :
invalid input found on input connection 'C:\Users\GUSTAV~1\AppData\Local\Temp\RtmpIF1RLf\file1e1030423fde.html'
@gustavobrp Thanks again for the follow-up. This is quite puzzling. Would there be a way for you to share the data (privately if necessary) after making sure everything is anonymized? Also, and sorry if I did ask you that before on another thread (I'm not sure), could you tell me what system you're running?
Sure, I can send a reduced sample from it. Can I send to your email?
I'm running Windows 10, RStudio 1.2.5033 and R 3.6.2.
Great! Just make sure the problem is still there using that sample. Email is fine (dominic.comtois, gmail).
@gustavobrp Thanks for sending the data. Clearly we have yet another case of encoding headaches. I don't have a fix yet, but here are two alternatives to make it work in the meantime:
encoding = 'latin1'
parameter
orfor (i in seq_along(datos))
if (is.factor(datos[[i]]))
Encoding(levels(datos[[i]])) <- "latin1"
Let me know if that works for you!
Oh I see!
Ok, so the first alternative worked! The group_by
and also the characters strings.
Thanks a lot for the help!
I am trying to get a dfSummary() by a specific grouping variable in a dplyr chain. For example:
However, in the R console, this outputs the summary statistics for the ungrouped data. As far as I can tell from the documentation, this is the correct way of doing this. What am I doing wrong?