benjann / violinplot

Stata module to draw violin plots
MIT License
11 stars 1 forks source link

28nov2022 (1.0.9) #2

Closed ericmelse closed 1 year ago

ericmelse commented 1 year ago

How beautiful indeed are the new options!

benjann commented 1 year ago

thanks

ericmelse commented 1 year ago

Dear Ben,

I am exploring your violinplot option rag for split (binary) violin plots as well as for unary plots. Although it works just fine there might be room for some improvement. First, I create a split (binary) violin plot using this code:

* Setup - single outcome 
sysuse nlsw88 , clear
gen ln_wage = ln(wage)
egen z_ln_wage= std(ln_wage)
* split violin
violinplot z_ln_wage, split(south, offset(.2)) box(type(l) stat(p20 p80)) ///
    mean(type(m) msize(*1.5) ms(oh)) rag(mc(%10) off(.6))  ///
    ytit("ln(wage/GNP deflator) (standardized)", s(*1.1) margin(l-2 r+2)) ///
    xsca(r(0 5)) name(sth, replace)

Which results in: Example_violinplot_nlsw88_z_ln_wage_south no issues with this plot but just to appreciate my next step.

Next, I use code to create a unary violin plot but together with a rag indicating the number of cases along the scale. Note that I have added graphreg(margin(b+9)) to have the same layout as with the previous plot (i.e. for a lack of the legend).

* violin
violinplot z_ln_wage, col("40 165 204") fill box(type(l) stat(p20 p80)) ///
    med(type(l)) mean(type(m) msize(*1.5) ms(oh)) rag(mc(%10) off(.6))  ///
    ytit("ln(wage/GNP deflator) (standardized)", s(*1.1) margin(l-2 r+2)) ///
    xsca(r(0 5)) vert name(s_left, replace) graphreg(margin(b+9))

Which results in: Example_violinplot_nlsw88_z_ln_wage_blue

Next is an alternative visualisation of the previous unary violin plot but as a 'half-violin' plot. I am forced to create a utility variable for this (count) that is coded 1 for all cases. Next I have to expand the data set by one case and code that to 0 in count as to be able to use it to split the violin plot. (Possibly I have not understood all options well enough to be able to do this without my trickery.) Obviously there is no scale data (z_ln_wage) in the single count case with the code 0. Thus, violinplot is now 'tricked' as to visualize only one half of the violin (basically the kernel density). The advantage is that now I can put the rag to the left of the violin, like:

* Setup - single outcome - half violinplot example
* Requires a little trick coding
gen count=1
set obs `=_N+1'
replace count=0 if count==.
violinplot z_ln_wage, sp(count) col("40 165 204") fill box(type(l) stat(p20 p80)) ///
    med(type(l)) mean(type(m) msize(*1.5) ms(oh)) rag(mc(%10) off(-.1))  ///
    ytit("ln(wage/GNP deflator) (standardized)", s(*1.1) margin(l-2 r+2)) ///
    xsca(r(0 5)) name(s_half, replace)

Which results in: Example_violinplot_nlsw88_z_ln_wage_blue_half

When you agree with my approach to create such a half-violin plot, maybe you can add an option (e.g. half) to reduce the effort of working with a facilitating binary variable by automating that in your ado file (should that now not be possible).

Maybe you wonder why I want to use such half-violin plot. I will explain that in my next post.

benjann commented 1 year ago

No need to trick split(), as there is left and right to print only half of the violin (modified/added options on last line):

violinplot z_ln_wage, col("40 165 204") fill box(type(l) stat(p20 p80)) ///
    med(type(l)) mean(type(m) msize(*1.5) ms(oh)) ///
    ytit("ln(wage/GNP deflator) (standardized)", s(*1.1) margin(l-2 r+2)) ///
    xsca(r(0 5)) vert name(s_half, replace) graphreg(margin(b+9)) ///
    rag(mc(%10) off(.1)) right nowhisk

s_half

ben

ericmelse commented 1 year ago

Very nice Ben!

ericmelse commented 1 year ago

Dear Ben,

Now I provide code for an example of how to use violinplot to create a combination of several plots of the same outcome variable as a right-violin plot of all cases together with four half-violin plots (in this example). The objective is to compare the unary distribution pattern with the binary distribution patterns. The main purpose of a regular violin plot is to compare differences in the distribution. However, using the option absolute enables absolute scaling of each group of a binary (coded) variable. This makes it possible to create multiple violin plots of the same outcome variable and compare the distributions while being aware of group size differences between these distributions. My code below uses the help file data example to this purpose (sysuse nlsw88). Note that I employ Stata's graph combine command to create the end result graph of the violin plots. But be aware that the graph combine command (in my experience) has its own rather strange way of dealing with the size and relative proportion between the various components that make up the graph (source graphs as well as the end result graph). It might be possible that I have not understood well enough how to go about coding the source graphs and what Stata's graph combine command does to them to create the result graph. Nevertheless, I am very much 'result' driven and satisfied that my result graph meets the following criteria:

The example code:

* Setup
sysuse nlsw88 , clear
* Standardize outcome variable 
gen ln_wage = ln(wage)
egen z_ln_wage= std(ln_wage)
* Apply prefix b_
foreach var in collgrad c_city south union {
    ren `var' b_`var'
    label values b_`var' lbl_`var'
    *label var b_`var' "by `var'"
}

* Unary or right-violin plot
graph drop _all
local xtit "all cases"
violinplot z_ln_wage, fill box(type(l) stat(p20 p80)) ///
    mean(type(m) msize(*1.7) ms(dh)) med(type(l)) ///
    c(gs8) bc(gs15) medc(gs16) graphr(c(gs15)) vert ///
    xsca(r(0 0)) ysca(noextend) ylab(-4(1)4, angle(none) glc(gs15)) ///
    ytit("ln(wage/GNP deflator) (standardized)", s(*1.1) m(r+2)) ///
    xtit("`xtit'", m(t+3 b+2.4)) rag(mc(%10) off(.1)) right nowhisk ///
    saving(unary, replace)

* Binary or half-violin plots * Absolute scaling by groups (demographic)
local xtit "graduation"
violinplot z_ln_wage, split(b_collgrad, offset(.2)) box(type(l) stat(p20 p80)) ///
    abs mean(type(m) msize(*1.7) ms(dh)) rag(mc(%10) off(.6)) ///
    c("255 134 53" "1 150 239") bc(gs15) medc(gs16) graphr(c(gs15)) xsca(r(0 0)) ///
    ytit("") ysca(noline) ylab(-4(1)4, nolab notick glc(gs15)) legend(off) ///
    xtit("`xtit'", m(t+5.5 b+0)) text(-4.32 -.59 "not so", s(*.9) placement(e)) ///
    text(-4.32 .59 "college", s(*.9) placement(w)) saving(binary1, replace)

local xtit "city"
violinplot z_ln_wage, split(b_c_city, offset(.2)) box(type(l) stat(p20 p80)) ///
    abs mean(type(m) msize(*1.7) ms(dh)) rag(mc(%10) off(.6)) ///
    c("teal*1" "teal*1.7") bc(gs15) medc(gs16) graphr(c(gs15)) xsca(r(0 0)) ///
    ytit("") ysca(noline) ylab(-4(1)4, nolab notick glc(gs15)) legend(off) ///
    xtit("`xtit'", m(t+5.5 b+0)) text(-4.32 -.59 "not so", s(*.9) placement(e)) ///
    text(-4.32 .59 "lives there", s(*.9) placement(w)) saving(binary2, replace)

local xtit "south"
violinplot z_ln_wage, split(b_south, offset(.2)) box(type(l) stat(p20 p80)) ///
    abs mean(type(m) msize(*1.7) ms(dh)) rag(mc(%10) off(.6)) ///
    c("blue*.7" "blue*.9") bc(gs15) medc(gs16) graphr(c(gs15)) xsca(r(0 0)) ///
    ytit("") ysca(noline) ylab(-4(1)4, nolab notick glc(gs15)) legend(off) ///
    xtit("`xtit'", m(t+5.5 b+0)) text(-4.32 -.59 "not so", s(*.9) placement(e)) ///
    text(-4.32 .59 "lives there", s(*.9) placement(w)) saving(binary3, replace)

local xtit "union"
violinplot z_ln_wage, split(b_union, offset(.2)) box(type(l) stat(p20 p80)) ///
    abs mean(type(m) msize(*1.7) ms(dh)) rag(mc(%10) off(.6)) ///
    c("maroon*.8" "maroon*1.5") bc(gs15) medc(gs16) graphr(c(gs15)) xsca(r(0 0)) ///
    ytit("") ysca(noline) ylab(-4(1)4, nolab notick glc(gs15)) legend(off) ///
    xtit("`xtit'", m(t+5.5 b+0)) text(-4.32 -.59 "not so", s(*.9) placement(e)) ///
    text(-4.32 .59 "worker", s(*.9) placement(w)) saving(binary4, replace)

* Finally combine the above into the result graph
graph drop _all
graph combine "unary.gph" "binary1.gph" "binary2.gph" "binary3.gph" "binary4.gph" , col(5) ycommon ///
note("The mean is indicated by the {fontface Verdana:{&loz}} symbol in the plots. IQR lines are at the 20{sup:th}, 50{sup:th} and 80{sup:th} percentile. Binary groups are absolute. ", span s(*.8) m(l+1 t-0 b+2)) ///
graphr(c(gs15) ilc(none) m(l-2 r-2 t-2 b-4)) imargin(small)

Which results in: Example_violinplot_nlsw88_z_ln_wage_combined_abs

One last note, about all of the above I have been corresponding earlier this year with Fernando Rios-Avila while using his package joy_plot with which a very similar result can be achieved (but admittedly, that requires a little more code).