briandk / granovaGG

Bob Pruzek and Jim Helmreich's implementation of Elemental Graphics for Analysis of Variance
Other
15 stars 4 forks source link

Granovagg1w jitter interferes with overplotting logic #135

Closed briandk closed 3 years ago

briandk commented 13 years ago

The default value for jitter (jj) in granova.1w was 1:

granova.1w(data, group = NULL, dg = 2, h.rng = 1.25, v.rng = 0.2, 
   box = FALSE, jj = 1, kx = 1, px = 1, size.line = -2.5, 
   top.dot = 0.15, trmean = FALSE, resid = FALSE, dosqrs = TRUE, 
   ident = FALSE, pt.lab = NULL, xlab = NULL, ylab = NULL, 
   main = NULL, ...)

For granovagg.1w, we introduced overplotting logic that would recalibrate jitter if groups were too close together. The idea was to protect the user: if groups were too close together, the jittering amount would be very conservative so adjacently plotted groups could still be individually resolved. But, introducing that logic involved a key departure from the classic granova.1w API. With the new overplotting logic:

If we want change the default value of jj to 1, I fear we'll have to go one of two routes:

  1. Remove the jittering overplotting logic entirely.
  2. Keep the overplotting logic, but conditionally ignore any user-supplied jj values if the group means are in danger of overplotting.
  3. Keep the jittering overplotting logic, but add a new parameter to granovagg.1w called something like safe.jittering that would turn the overplotting logic on or off, thus allowing the user to bypass overplotting protection.

Option 1 is simple and possible. Combined with our current logic for marking likely overplotted groups in red, it lets the user simply tweak and shrink jittering until they can safely resolve groups. But it's risky if users don't recognize when their data is overplotted.

Option 2 strikes me as undesirable: it gives users the illusion of having control over jittering when really we can be like overprotective parents and override their supplied value if we think it's too dangerous.

Option 3 is kludgy, inelegant, and adds an additional burden to the user for remembering parameters.

I need both @rmpruzek and @wildoane to weigh in on this issue before I can go forward with any changes.

rmpruzek commented 12 years ago

Overplotting is clearly something we want to avoid, but there are several ways to deal with this, and after a good deal of thought, I want to recommend the following (and this does remove the overplotting logic as per Brian's narrative above): Suppose a subset of k means, and hence the effects for the corresponding groups [m < j > - grandmean] are 'sufficiently close to one another that the case-data points run into one another. (Okay to be 'liberal' about choosing k; better to overestimate than underestimate.) Now proceed to alter the .1w graphic (so it will lack complete fidelity wrt the initial means), as follows. Compute the median of these k means. Now compute 'pseudo effects' by subtracting the median* (md) from each of the k means; these will be of the form m < j > - md. Multiply all pseudo effects by a constant (W, say, where W exceeds unity (or one) by a positive constant w, where w is a function of the range R of all means. (e.g., w* = R/25 seems reasonable. This leads to W = 1 + R/25. Finally, add the midrange mr to each of these revised pseudo effects; these will be of the form W*(m < j > - md) + md. That's it. These k values will serve as replacements for the original group means. They will necessarily be separated from one another in relation to the original k means simply because W exceeds unity. The average of all 'means' will not be changed (more than trivially) from the original grand mean, and the printed table should probably just ignore these adjusted means (except in trials?) so it will only be the graphic that has been altered. Again, perfect fidelity w/ the original data will of course have been lost in the graphic, but the gain will more than compensate for the rather minor changes in the data-to-be-plotted.

rmpruzek commented 12 years ago

NB: I had written m sub j using < and > to index j, but these have been lost here. I shall put the original in a Word document, which I'll email or post (? where on github), so that the details are not lost. b NB2: My edits, now w/ spacing, seem to have fixed the problems. Let me know what is unclear.

briandk commented 12 years ago

@rmpruzek - Based on my understanding of your post, I'm not convinced your method is general enough to be safe.

Consider an example where we plot some group of means, but we're interested in means 1 - 4. Your method would identify a subset of k means (viz. 2 - 4), then apply pseudo-effects to them and visually alter their position. The potential problem I see is that in introducing pseudo-effects, you might also produce a situation where a new overplotting results from adjusting the old data. In the image below, adding pseudo-effects to means 2 - 4 actually results in means 1 and 2 now being overplotted:

Image showing how new overplotting can result from applying pseudo-effects

So, in sum, I'm not convinced that your proposal "solves" the overplotting problem unless you can convince me that it will never introduce new overplotting.

rmpruzek commented 12 years ago

This merely says that there may be situations where the method might have to be applied iteratively. In this case, k = 2 for pair(1,2). Apply the method again, and as long as the W is reasonably chosen, all should be well after the second cycle has been completed. As to a general 'proof' that the method will never, can never, fail, let's remember the old dictum: the best (or perfect) can be the enemy of the good. I do not seek universal perfection, and recommend we get on w/ our lives after making a reasonable try for a good fix. (And I do not seek anyone's approval, nor should you mine.) bob

rmpruzek commented 12 years ago

Correcting an error in my post of 4 days ago: My sentence (near the middle) "Finally, add the midrange mr to each of these revised pseudo effects; these will be of the form W(m < j > - md) + md." should say, "Finally, add the median md to each of these revised pseudo effects; these will be of the form W(m < j > - md) + md." There is another issue here too, tho' this one really should be discussed synchronously (ichat?): When the no. of cases (N) becomes 'quite large' we might transition to something along the line of boxplots (w/ jittered interior points), or violin plots for the respective groups. That the issue involves various questions of judgment, programming complexity, etc., plus the fact that it is so basic is why it deserves discussion among several of us. We might also look at some real-data trials w/ the k (subset of) means idea implemented if you, Brian, would be willing to write the code for this to facilitate trials. b

briandk commented 12 years ago

Refs #134