dcomtois / summarytools

R Package to Quickly and Neatly Summarize Data
523 stars 78 forks source link

Changing style of dfsummary/Export dfsummary object in another package #104

Closed gustavobrp closed 4 years ago

gustavobrp commented 4 years ago

It is possible to export the results of dfsummary() to another package in order to get a different style of the table?

Specifically, I'm trying to use _sjPlot::tabdf() to get a table the could be publication ready for future articles. But when I create an object from dfsummary, the data.frame produced shows characters like "\." (I'm sending a print to show the result). So when I use tab_df to export the results, the table is ruined.

summary_dataset <- dfSummary(dataset[,c(91,5:7,9:46)], plain.ascii = TRUE, na.col = F, max.distinct.values = 8, max.string.width = 100, split.tables = 50, round.digits = 2, varnumbers = F)

view(resumo_banco3)

sjPlot::tab_df(summary_dataset)

I'm doing something wrong or there is another way to do is? Or could be this implemented in the future?

Not sure if this would be the right place to ask that, but, anyway, really great package!

example 01 example 02

dcomtois commented 4 years ago

Hi gustavo,

Although freq() and descr() objects can easily be manipulated by other packages (especially when transformed using tb(), dfSummaries are more tricky since they are multiline tables.

One question.. When using view(), do you see the html rendered results in RStudio's Viewer (or Web Browser), or do you get the same as when using View(), i.e. a view of the raw table content? Some packages redefine view() to act like View(), so if you load such a package after loading summarytools, its view() is no longer active. A way around this is to load summarytools after the other package(s). Another way is to use print(dfSummary(...), method = "viewer"). Using method = "render" is also useful for generating the appropriate html code.

If you're already aware of this and really need to use sjPlot, one thing that you could try is using the CSS= argument and specify summarytools' CSS file (your r library/summarytools/includes/sylesheets/summarytools.css). I can't say for sure what results will look like, but it's probably worth a try.

gustavobrp commented 4 years ago

Hello Dominic, thanks a lot for the reply.

Following your question, I used both, view() of summarytools and View() from Utils, and the results were the same.

I will try your suggestion about the css argument. I'm still learning to manipulate and use this tools, so maybe it will take a while. But my objective was mainly get the style of table produced by sjPlot with the dfsummary - I saw other user here that posted similar results of what I wanted.

Again, thanks a lot for the help.

dcomtois commented 4 years ago

Hi @gustavobrp,

Indeed if view() behaves like View(), then definitely it is not summarytools' view() (see my previous message for getting around this).

I'm not aware of successful attempts at generating proper dfSummaries in pdf format, at least not when png graphs are included. The graphs cause all other content on their line to shift below them. If you succeed or see someone succeeding, please let me know! The best solution I found was to generate an html file and converting it to pdf; the best results are produced with wkhtmltopdf.

Another possible avenue would be to create a custom css file with class definitions that emulate sjPlot's css and use this css file and its classes when generating the dfSummary. But this would not be so easy.

gustavobrp commented 4 years ago

I see. I generated the dfSummary table in PDF using wkhtmltopdf. The results were satisfactory, but I had to exclude several columns. And like you said, with the graphs, the row lines were all wrongly placed. So I just transformed the HTML in a image and placed in the Word document that I’m writing.

By the way, I didn’t saw the possibility to exclude the column with the variables raw names in the table, only their labels. Did I miss something or is possible to implement this? Sometimes, when we have variables with labels is better to use them already and exclude the raw column name.

Thanks.

dcomtois commented 4 years ago

Hmmm I'm surprised you still get misaligned content using wkhtmltopdf. I get clean results on my side. What if you use "print to pdf" from your browser? The only issue I get when doing this is some unwanted table borders inside the Freqs (% of Valid) cells (same problem when using Pandoc, that's why I recommend wkhtmltopdf).

For your other question, you can remove the variable name column this way:

dfs <- dfSummary(tobacco)
dfs$Variable <- NULL
gustavobrp commented 4 years ago

Hi Dominic, sorry for taking long to answer back!

Thanks for the last tip (it was in front of me!).

But anyway, weird thing. Was going to try your suggestion to print to pdf and I run again my code to get a new dfsummary() table. But, weirdly, now the results are taking too long to print and crashing R Studio. What am I doing wrong? In the last week I worked on my original dataset, basically adding new variables. Theses variables are numeric and resulted from scale transformations using packages like standardize

I'm running the following code:

resumo_banco3 <- dfSummary(banco3[,c(92,5:7,10,12,22:25,27,38,43:47,90)], plain.ascii = TRUE, na.col = F, max.string.width = 100, split.tables = 50, round.digits = 2, varnumbers = F)

view(resumo_banco3)

When I press stop, before RStudio crahes, I get the following message:

Warning message: In readLines(f, warn = FALSE, encoding = "utf-8") : invalid input found on input connection 'C:\Users\GUSTAV~1\AppData\Local\Temp\RtmpAzOL6P\file21605ff41685.html'

It seems to be related to new variables that I created. Don't know if would help, since this coding is local based, but this is all the code chunk that is giving problem (once I cut this out, everything works fine).

banco3$MATEMATICA_dp <- (banco3$MATEMATICA - mean(banco3$MATEMATICA)) / sd(banco3$MATEMATICA) banco3$LINGUAGENS_dp <- (banco3$LINGUAGENS - mean(banco3$LINGUAGENS)) / sd(banco3$LINGUAGENS) banco3$CIENCIAS_DA_NATUREZA_dp <- (banco3$CIENCIAS_DA_NATUREZA - mean(banco3$CIENCIAS_DA_NATUREZA)) / sd(banco3$CIENCIAS_DA_NATUREZA) banco3$CIENCIAS_HUMANAS_dp <- (banco3$CIENCIAS_HUMANAS - mean(banco3$CIENCIAS_HUMANAS)) / sd(banco3$CIENCIAS_HUMANAS)

banco3$MATEMATICA_dp <- formattable(banco3$MATEMATICA_dp, digits = 5)

banco3$LINGUAGENS_dp <- formattable(banco3$LINGUAGENS_dp, digits = 5)

banco3$CIENCIAS_DA_NATUREZA_dp <- formattable(banco3$CIENCIAS_DA_NATUREZA_dp, digits = 5)

banco3$CIENCIAS_HUMANAS_dp <- formattable(banco3$CIENCIAS_HUMANAS_dp, digits = 5)

banco3$DESEMPENHO <- banco3$MATEMATICA + banco3$LINGUAGENS + banco3$CIENCIAS_DA_NATUREZA + banco3$CIENCIAS_HUMANAS banco3$DESEMPENHO_dp <- (banco3$DESEMPENHO - mean(banco3$DESEMPENHO)) / sd(banco3$DESEMPENHO)

banco3$RSG.2012/1 <- as.numeric(gsub(",",".", banco3$RSG.2012/1, fixed=TRUE)) banco3$RSG.2012/2 <- as.numeric(gsub(",",".", banco3$RSG.2012/2, fixed=TRUE)) banco3$RSG.2013/1 <- as.numeric(gsub(",",".", banco3$RSG.2013/1, fixed=TRUE)) banco3$RSG.2013/2 <- as.numeric(gsub(",",".", banco3$RSG.2013/2, fixed=TRUE)) banco3$RSG.2014/1 <- as.numeric(gsub(",",".", banco3$RSG.2014/1, fixed=TRUE)) banco3$RSG.2014/2 <- as.numeric(gsub(",",".", banco3$RSG.2014/2, fixed=TRUE)) banco3$RSG.2015/1 <- as.numeric(gsub(",",".", banco3$RSG.2015/1, fixed=TRUE)) banco3$RSG.2015/2 <- as.numeric(gsub(",",".", banco3$RSG.2015/2, fixed=TRUE)) banco3$RSG.2016/1 <- as.numeric(gsub(",",".", banco3$RSG.2016/1, fixed=TRUE)) banco3$RSG.2016/2 <- as.numeric(gsub(",",".", banco3$RSG.2016/2, fixed=TRUE)) banco3$RSG.2017/1 <- as.numeric(gsub(",",".", banco3$RSG.2017/1, fixed=TRUE)) banco3$RSG.2017/2 <- as.numeric(gsub(",",".", banco3$RSG.2017/2, fixed=TRUE)) banco3$RSG.2018/1 <- as.numeric(gsub(",",".", banco3$RSG.2018/1, fixed=TRUE)) banco3$RSG.2018/2 <- as.numeric(gsub(",",".", banco3$RSG.2018/2, fixed=TRUE)) banco3$RSG.2019/1 <- as.numeric(gsub(",",".", banco3$RSG.2019/1, fixed=TRUE)) banco3$RSG.2019/2 <- as.numeric(gsub(",",".", banco3$RSG.2019/2, fixed=TRUE))

banco3$RSG_total <- rowSums(banco3[,c(66:81)], na.rm = TRUE) banco3$RSG_total_media <- rowMeans(banco3[,c(66:81)], na.rm = TRUE) banco3$RSG_total_media_dp <- (banco3$RSG_total_media - mean(banco3$RSG_total_media)) / sd(banco3$RSG_total_media)

scale_this <- function(x){ (x - mean(x, na.rm=TRUE)) / sd(x, na.rm=TRUE) }

banco3 <- banco3 %>% group_by(NOMECURSO) %>% mutate(RSG_total_curso = scale_this(RSG_total))

rm(scale_this)

banco3$RSG_total_media_curso <- standardize::scale_by(RSG_total_media ~ NOMECURSO, banco3) banco3$RSG_total_media_curso <- as.numeric(banco3$RSG_total_media_curso)

It looks like I entered a whirl of confusion!

Anyway, thanks!

EDIT

So I just noticed that the function group_by() was causing the problem. One of the scaling was produced using group_by, and once that I "ungrouped" the data, dfsummary() worked.

I tried again following your suggestion, and selecting a few variables, and I got a better table. Thanks a lot!