benjann / violinplot

Stata module to draw violin plots
MIT License
11 stars 1 forks source link

Drawing of the distribution curve #3

Open ericmelse opened 1 year ago

ericmelse commented 1 year ago

Dear Ben, Using violinplot, I observe an unwanted extension of the drawing of the distribution curve beyond the minimum and the maximum value of the measurement values. It is not possible to 'chop' that from the distribution line unless you are able (as an user) to delete the data points used to draw the line beyond the real data values. That is mostly likely not something you want me to do. So, I will try to explain my point with the following example for which I use my own data. In this case the annual salary of 30.000 working professionals in The Netherlands. That data was log transformed, centered and bounded between 0 and 1 for this example. Github will not allow me to upload a dta file, so, I will email that to you.

The code to draw the distribution using tw kdensity is: tw kdensity bd_zs_lnMsalYer21 , n(200) ysc(off) xsc(noex) ylab(none) ytit("") xtit("") legend(off) name(kden, replace) which results in: Distribution_line_manual_kden

The code to draw the distribution using violinplot is: violinplot bd_zs_lnMsalYer21 , n(200) left nowhisk nobox nomed xtit("") ysc(off) xsc(noex) name(violin, replace) which results in: Distribution_line_violinplot

My 'problem' is that the distribution curve is now extended beyond the minimum and the maximum value of the measurement values. I suppose that this is due to way the formulation works (but notice the alternative that joy_plot offers below).

The code to draw the distribution using joy_plot is: joy_plot bd_zs_lnMsalYer21 , ysc(off) xsc(noex) fc(none) ytit("") xtit("") legend(off) cap("") name(joy, replace) which results in: Distribution_line_joy_plot Fernando's joy_plot draws a 'closed' object by which the extremes are gradually thinned. I suppose that sort of signals that there are less and less data points in the distribution. Something comparable to your 'rags'. I am not asking you to do something simular although this might work very well when drawing split-violin plots as the upper and lower end of the distribution will be 'sharp'. You would have to read his ado file about how he implemented this (certainly you can write him too). But, most wanted by me for now would be to have violinplot not draw a distribution line beyond the minimum and the maximum value of the measurement values.

benjann commented 1 year ago

Hi Eric, try option tight. It is quite standard to add some padding in density estimation and this is also how violinplot behaves by default. However, you can specify tight to enforce an evaluation range from the observed min to max.

ericmelse commented 1 year ago

Dear Ben, after installing your last update (version 1.1.1), I added tight to my code, like:

use "NSO21_Example_Salary.dta" , clear
violinplot bd_zs_lnMsalYer21 , n(200) tight left rag(unique mc(%20) off(-.03)) nowhisk nobox nomed ///
   ysc(off) xsc(noex) xtit("") name(violin, replace)

but, an error message is returned:

option tight not allowed
r(198);

Possibly something needs correction.

benjann commented 1 year ago

Hi Eric, I forgot to mention: you need to update dstat and moremata. That is, type

ssc install dstat, replace
ssc install moremata, replace

violinplot calls dstat, which calls mm_density() from moremata. The default in mm_density() is to use a padded evaluation grid, but in the updated version there is now also an option to use a tight grid (using a tight grid was always possible, but the limiting values had to be provided manually; the new option automates this). I then added a tight option to dstat and finally to violinplot.

benjann commented 1 year ago

To be precise, option tight was available in violinplot already since version 1.0.4 (25oct2022), but in version 1.0.7 (20nov2022) I moved this functionality into mm_density(). At this point I also added ltight and rtight to use a tight grid only on one side of the distribution.

ericmelse commented 1 year ago

You are right Ben, the system I was using did not have the last version of dstat and moremata. I should have checked that. So, now indeed I get the desired result: Rag_violin_new I am very happy with this option. Then, about your note that 'it is quite standard to add some padding in density estimation'. Well, yes, I am aware that this is usual practice and possibly sensible when dealing with measurements on a continuous ratio scale that supposedly has no limits. Or, is the notion that we should understand the density curve as some formulation that is bounded beyond the values present in the set? My understanding is that the kernel density curve sums up to a (dimensionless) total of 1. If so, my view is that such curve should visualize in between the maximum and minimum value of the measurement. Certainly that would make sense (to me) when we have a measurement that is bounded between 0 and 1 (or any other values). Hence my inclination to use a tight' distribution curve when drawing a violin plot.

benjann commented 1 year ago

Exactly, density curves are constructed in a way such that the area under the curve integrates to 1. Technically, this is no longer true if you truncate the curve at the observed min and max (unless you apply boundary correction, see options ll() and ul() in pdfopts(); sometimes, variables have known bounds, e.g. "number of hours worked" cannot be negative; in such cases it makes sense to provide this information to density estimation to avoid an artificial drop of the curve in the region of the boundary). The purpose of the padding is to allow the density curve to go to zero (at least approximately) below the min and above the max, that is, to display the full curve. In essence, the idea is that your data is a sample from a population and that you want to get an idea of how the population is distributed (and the data in the population may go beyond the observed min and max). However, a violin plot can also be used to illustrate the features of the data, not necessarily the population, and in this case I agree it make sense to truncate the curve at the observed min and max, so that one does not get a wrong impression and sees where the actual min and max are.

ericmelse commented 1 year ago

Dear Ben, thank you for your helpful further explanation. I have read the relevant sections in the dstat and mm_density() help files. I try to understand what I can but I excuse beforehand when that is unsufficient. I compared the above example using pdfopts(ll(0) ul(1)) and that does not make a difference in the visualisation (maybe in related statistics, I did not explore that). So, next, I try my hand at another example of a bounded scale. In this case it is about my students who might earn in between 0 and 60 ects during their first year at our university of applied sciences. This for a more involved study of their demographic and psycho-social profiles in relation to educational outcome as measured by ects. With this data I do get a difference. I show you my code so you can compare the use of options. First, using violinplot the code using tight is:

* tight
local constraint "pl_ptnaherk>=0 & degreeVts18!=. & c_cohort==7"
violinplot pl_ptnaherk if `constraint' , n(200) pdfopts(k(e)) left tight ///
nowhisk nobox nomed xsc(noex) xlabel(0(5)60) xmtick(1(1)59) xtit("ects", m(t+2 b-2)) name(violinplot1, replace)

which results in: HVA_FBE_18_ects_violin_tight I tried to get the density scale also visualized on the y-axis but somehow I fail to set the option for that to happen incorrectly. Maybe you can tell me how to do that. Because dstat is used to do the calculations, I replicate using this code:

local constraint "pl_ptnaherk>=0 & degreeVts18!=. & c_cohort==7"
dstat density pl_ptnaherk if `constraint' , n(200) k(e) tight ///
gr(p1(ysc(noex) ylab(, labs(*1) angle(none))) xsc(noex) xmt(1(1)59) ///
xlabel(0(5)60) xtit("ects", m(t+2 b-2)) name(dstat1, replace))

which results in: HVA_FBE_18_ects_dstat_tight Next, the same but with not using tight but instead ll(0) ul(60) :

local constraint "pl_ptnaherk>=0 & degreeVts18!=. & c_cohort==7"
dstat density pl_ptnaherk if `constraint' , n(200) k(e) ll(0) ul(60) ///
gr(p1(ysc(noex) ylab(, labs(*1) angle(none))) xsc(noex) xmt(1(1)59) ///
xlabel(0(5)60) xtit("ects", m(t+2 b-2)) name(dstat2, replace))

which results in: HVA_FBE_18_ects_dstat_llul I suppose that this is as more trustworthy visualisation because at either end of the density curve we see a straight line pointing to the high density value at 0 and 60 (which is indeed the case as these are the drop-outs and most succesful students). I think this underpins your point 'to avoid an artificial drop of the curve in the region of the boundary'.

Then, before all of this I have used Stata's kdensity, like:

local constraint "pl_ptnaherk>=0 & degreeVts18!=. & c_cohort==7"
tw kdensity pl_ptnaherk if `constraint' , n(200) k(ep) ///
xsc(noex) ysc(noex) ylab(, labs(*1) angle(none)) ///
ytit("") xtit("ects", m(t+2 b-2)) xlab(0(5)60 , labsize(*1.1) tlwidth(*1)) ///
xmtick(1(1)59) name(kden, replace)

which results in: HVA_FBE_18_ects_kdensity I suppose that I now have to conclude that this visualization is horribly wrong (I have been using this for years!). What puzzles me somewhat is that in all of the above I set the kernel method to Epanechnikov, yet the result is so different even without using the ll() and ul() options. Can you explain to my why this is? And maybe how dstat could replicate Stata's kdensity result as such?

benjann commented 1 year ago

Hi Eric, this seems to be a difficult case for density estimation. Kernel density estimation works well for distributions that are smooth, at least to a certain degree. Strong heaping (in your case at 0 and 60) creates discontinuities that are difficult to capture because, fundamentally, kernel density estimation rests on the assumption that there are no discontinuities. So, in data such as the ones above it is always good to also take a look at a histogram with a sufficient number of bins.

The reason why kdensity and violinplot or dstat provide different estimates is that kdensity uses a different method to determine the bandwidth (i.e. the smoothing parameter). The method use by kdensity is called "optimal of Silverman" and is a rule-of-thumb method that works well only for rather smooth distributions. violinplot and dstat, by default, use a 2nd generation bandwidth selector that is supposed to be more adaptive (i.e. small bandwidth if the distribution is jumpy, large bandwidth if it is smooth). You can specify bwidth(silverman) to use the optimal of Silverman instead. This will lead to a similar result as kdensity (the bandwidth computed by kdensity will be about 1% larger than the Silverman bandwidth computed by dstat if the kernel is epanechnikov; in dstat the bandwidth is adjusted to the canonical bandwidth of the chosen kernel, as it should; kdensity, however, always uses the bandwidth formula appropriate for a Gaussian kernel, irrespective of the choice of the kernel; the mentioned 1% difference is because the canonical bandwidth of the Gaussian kernel is a bout 1% larger than the canonical bandwidth of the Epanechnikov kernel).

I compared the above example using pdfopts(ll(0) ul(1)) and that does not make a difference in the visualisation

Since in this example the density curve is already practically zero at the min and at the max, boundary correction will be too small to be visible.

I tried to get the density scale also visualized on the y-axis but somehow I fail to set the option for that to happen incorrectly. Maybe you can tell me how to do that.

Short answer: specify dscale(.).

Long answer: violinplot displays a categorical axis by default because this is what one usually would want. It also rescales the density estimates so that all estimates are within 0 and 0.5 (if multiple violins are plotted, the default gap between violins is 1 unit; the rescaling makes sure that the violins do not overlap), so printing the scale on the axis would be misleading. You can change how rescaling is done using option dscale(). In particular, you can specify dscale(.) to omit rescaling and use the original values of the density estimate. In this case, as long as only one violin (or only one set of overlaid violins) is plotted, the scale will be shown on the axis because it is now meaningful.

ericmelse commented 1 year ago

Dear Ben, my douze points go to the '2nd generation bandwidth selector'! Your further explanation about the differences in methodology between kdensity and dstat is most helpful. The scale(.) option works just fine, my result is: Rag_violin_new I take note of your cautionary remark in the Long answer.

Next, I have to exercise further with my ects case, reconciling the result of dstat/violinplot with that from your (updated) reldist (graph). So, please keep this Issue open as I want to share my upcoming result with you (with some questions).

ericmelse commented 1 year ago

Dear Ben, first, I have to confirm that it is very well possible to arrive at more or less the same distributional pattern as Stata's kdensity does, using the option pdfopts(bwidth(silverman)), like:

local constraint "pl_ptnaherk>=0 & degreeVts18!=. & c_cohort==7"
violinplot pl_ptnaherk if `constraint' , n(200) pdfopts(bwidth(silverman)) left tight ///
nowhisk nobox nomed dscale(.) ysc(noex) ylabel(,angle(none)) ///
xsc(noex) xlabel(0(5)60) xmtick(1(1)59) xtit("ects", m(t+2 b-2))

which results in: HVA_FBE_18_ects_violin_silverman_kdensity So, that is comfortable to know and at the same time a 'big' warning to no longer use that method when the case at hand involves a bounded scale and/or heaps of data at either end of the scale, or at both ends.

I suppose the following is a summary of my thoughts about working with violinplot/dstat and reldist. The distribution of ects, as you noted, is particular for its extremes at either end of the bounded scale [0,60]. Notable also is that 0 ects does have substantive meaning as that is about students who enrolled but either did not take any exam or failed them all and subsequently did not continue (drop-outs). Students who have 50 or more ects are those that are allowed to continue with their second year of study (hence the dashed line at 50 ects in the following figures). From a managerial perspective of interest are students who have in between 40 and 49 ects because we hypothesize that they might be able to have a better result once some form of educational 'treatment' is applied (students are forced to leave when they have less than 50 ects). I will not venture here on what educational 'treatment' could be, but just explain that the distributional pattern between the short-dashed line at 40 ects and the dashed line at 50 ects is of interest in the following figures. Most studies would analyse mean or median differences between groups (or cohorts). But, my focus is also on the distributional profiles and how these compare between groups or cohorts. That is why the violinplot visualization is so important (for me). But, having said that, also dstat and reldist visualization do show important aspects of the distribution of the measurement between the bounds of the scale [0,60] or part of it [40,49]. My objective is, besides drawing state-of-the-art statistical figures, to confirm my first (visual) interpretation of the data, using violinplot/dstat , as well as to develop working hypotheses in the process. I suppose it is this last point that drives me to write here about this case because there are subtle differences in between dstat and the reldist cdf and pdf plots. So, employing your recommendations, this is a dstat visualization of the ects scores of two cohorts (2018 and 2017): HVA_FBE_18vs17_ects_dstat_llul

My objective is to detect if the performance in 2018 is notably better than in 2017 (possibly statistically substantive too). Indeed, this appears to be the case as from 45 ects onwards, the 2018 distribution line wiggles above the 2017 line, except between [59,60] ects. Also note that below 8 ects the 2018 distribution line wiggles below the 2017 line, which also implies an improved performance at the lower end of the scale, albeit rather small. Next we inspect the relative cumulative distribution (cdf) of the same data: HVA_FBE_18vs17_ects_reldist_cdf

The cdf plot appears to confirm my interpretation of the distributional differences in the dstat plot between cohorts. The cdf-line, visualizing the cumulative probability of the differences between the 2018 and 2017 cohorts, crosses the equilibrium diagonal at about 50 ects. For values in the [0,50] ects range it lies below the diagonal and for values in the [50,60] ects range the line is above the diagonal. This is good news for either part of the scale as both results are desireable. Relatively, comparing the 2018 cohort with the 2017 cohort, less students score below 50 ects and more student score above 50 ects. Note that in between 40 and 49 ects it appears that in 2018 less students score at the lower end, but this is statistically not a substantive effect (indicated by the confidence interval which covers the equilibrium diagonal of this value range).

Next, I proceed with the plot of the density function of the relative distribution: HVA_FBE_18vs17_ects_reldist_pdf

I suppose that the pdf plot forces me to rethink my conclusions based on the cdf plot. We do not have here a 'clear cut' distributional difference below and above 50 ects. Although between [50,58] ects the relative distribution is advantageous for the 2018 cohort, in between [58,60] this appears not to be the case. My first thought was that I might have done something wrong or maybe there is some error in your code. But, that is not the case which becomes clear after inspecting the data (or the pdf result table):

SomVanptna |       c18cohort
      herk |      2017       2018 |     Total
-----------+----------------------+----------
        50 |       104        100 |       204 
           |      6.39       6.31 |      6.35 
-----------+----------------------+----------
        51 |        71         95 |       166 
           |      4.36       5.99 |      5.17 
-----------+----------------------+----------
        52 |       135        140 |       275 
           |      8.30       8.83 |      8.56 
-----------+----------------------+----------
        53 |        30         41 |        71 
           |      1.84       2.59 |      2.21 
-----------+----------------------+----------
        54 |        16         10 |        26 
           |      0.98       0.63 |      0.81 
-----------+----------------------+----------
        55 |       154        147 |       301 
           |      9.47       9.27 |      9.37 
-----------+----------------------+----------
        56 |       212        230 |       442 
           |     13.03      14.51 |     13.76 
-----------+----------------------+----------
        57 |        25         36 |        61 
           |      1.54       2.27 |      1.90 
-----------+----------------------+----------
        58 |        17          0 |        17 
           |      1.04       0.00 |      0.53 
-----------+----------------------+----------
        59 |         1          0 |         1 
           |      0.06       0.00 |      0.03 
-----------+----------------------+----------
        60 |       862        786 |     1,648 
           |     52.98      49.59 |     51.31 
-----------+----------------------+----------
     Total |     1,627      1,585 |     3,212 
           |    100.00     100.00 |    100.00 

At 58 and 59 ects there is no data in 2018 and at 60 ects the result is lower than in 2017 (nominal as well as relatively). Thus, the pdf plot is clearly accurate with its visualization. But, although the 'positive impression' that the students perform better in 2018 that I get from the violinplot/dstat plot as well as from the cdf plot might not be wrong, the pdf plot does signal that we have to be cautious as at particular segments of the scale students did underperform. My final conclusion is that the 'distributional analyst' better inspect all these plots to be able to question or corroborate results.

[I do have a request about the cdf plot but I will post about it in your reldist Github page.]