improve top 10 domain range

sgratzl commented 4 years ago

@RoniRos

The domain range for the color map is kind of okay because the color legend says e.g. "964.0+". But using it for the "Top 10" is a bit problematic because the user's focus is on the highest values, but the scale used does not show the highest values. We need a better long-term solution. In the meantime, is there a visual way to highlight the clipping/clamping? Maybe a "lightening" icon?

sgratzl commented 4 years ago

@RoniRos can you clarify on the Maybe a "lightening" icon? I don't understand that. Where should it be shown and what should it say?

RoniRos commented 4 years ago

Currently, the clipped values render as a straight horizontal line across the top of the clipped range. This suggests that something is wrong, but doesn't highlight the fact that the actual values are often much higher than what the chart can display.

I am not sure what would be a good visual cue that significant clipping has occurred. I was suggesting a "break" icon (looks a little like a "Z") that is often used in histograms for that purpose (e.g. in charts in The Economist), and I mistakenly called it a "lightening" icon, sorry.

sgratzl commented 4 years ago

Re making the fact of the clipping more noticeable, perhaps you could color the clipped signal a different color? Red would have been appropriate, but it is also used for the vertical bar. Another possibility is to indicate clipped values with tiny vertical bars at the top of the chart that are solid on their lower half but dotted on their top half, to indicate that they are continuing upwards.

sgratzl commented 4 years ago

the api would also provide us with the max_value for each signal, so we could also just use the maximum for the whole signal: https://api.covidcast.cmu.edu/epidata/api.php?source=covidcast_meta&cached=true

RoniRos commented 4 years ago

This may be too late for the current release, but why not adjust the visible Y scale to be the maximum value present in the data being displayed (across all 10+ locations)?

sgratzl commented 4 years ago

it most cases that is similar to using the overall max value since by default the table is sorted by the value in decreasing order. Yesterday, I tested it quickly but got a lot of white space because of some outliers in the data.

Regarding the maximum of the current visible subset: I can see the following drawbacks:

The user has to wait till all the plot data are loaded before the first one can be rendered, since we need all the data to compute the maximum value. Atm. the plots load independently.
When the user sorts by a different column or show 10 more locations, they maximum value will most likely change, which might be confusing to the user

for example, when using the global maximum for doctor visits:

the top 10 look pretty empty (max around 40%), cause the global maximum is around 80%. A local maxima for the shown data would fix it. However as soon as the user shows 10 additional rows, it would jump to 80% for all charts, since at the 12. place there is such as high value in the history.

RoniRos commented 4 years ago

I see your points. The problem with using the global maximum is that if there is even a single very high value anywhere, even in the far past, it will "squash" all the graphs without the user understanding why. We could use a "trimmed max", or the 90% percentile of the values, so that we avoid the worst extremes, and those will render as clipped. Alternatively, we could stick with the "maximum among visible data", and accept that if the user asks for another 10 locations, the scale may change. I personally find it acceptable. Even here, we can apply a "95% trimmed max", so in your example, the peak in Pettis County in late April will be clipped.

sgratzl commented 4 years ago

We could use a "trimmed max", or the 90% percentile of the values, so that we avoid the worst extremes, and those will render as clipped.

re percentile: see also https://github.com/cmu-delphi/delphi-epidata/issues/227

re among visible: the extreme case is if the user sorts first descending and then switches to ascending for a specific signal.

tildechris commented 4 years ago

Thoughts here:

We could do worse than picking bounds that do not clip the 1st entry
We could experiment with picking bounds at the currently selected date and dynamically changing the scale
We could experiment with making it more obvious when data is being clipped so that a constant trend does not appear similar to a clipped value.

@sgratzl does this give you enough to experiment? Also, I notice that the performance is a bit laggy on my machine when moving the date cursor around. We should be careful that we do not adversely affect performances while figuring out an improvement here.

sgratzl commented 4 years ago

I created a version which uses the local maximal in the current visible subset: https://github.com/cmu-delphi/www-covidcast/pull/552

* We could experiment with picking bounds at the currently selected date and dynamically changing the scale

I don't get how the scale is related to the current date since the chart shows histories

* We could experiment with making it more obvious when data is being clipped so that a constant trend does not appear similar to a clipped value.

There was the discussion on using a different color for clamped values (such that a fraction of the line appears in a different color). However, I don't see how this is obvious to the user that a horizontal colored line at the top means that they are clamped. When using bar charts you can see (as also noted by roni) that people sometimes use an indicator within the bar that something was cut out, e.g..

(https://vdl.sci.utah.edu/upset2/)

It uses two encodings: first bars are kinda wrapped up to two times and it it is still too large an indicator is used to show that the value doesn't fit anymore. However, I don't know nor aware of any method that does a similar thing for line charts

sgratzl commented 4 years ago

for more existing discussion see slack channel: covid-19-visualization with messages/threads from Sep 27 and Sep 29

dlaliberte commented 3 years ago

I did some searching for examples of how to deal with the clipping of lines in a line chart, and didn't find much, mostly just complaints about undesired clipping. Here are some more ideas for how we could render something to indicate the clipping:

Roni's suggestion of drawing dashed line at the ends, both exiting and entering.
Different background area (gray, red...) between the exiting and entering, maybe just a band at the extreme.
Dashed/dotted line (or gray, red...) at the extreme between exiting and entering.
Use Horizon chart mode, which is an Area chart where the clipped portion wraps around and is drawn in a darker color. More than 2 layers may be necessary.

Assuming there might be some clipping, we can reduce how much clipping we have to do by using the data in the neighborhood of the "current day" (before and after) to determine the maximum across all the displayed charts in that neighborhood. This way, we show the neighborhood around the current day without clipping, but may require clipping elsewhere.

dlaliberte commented 3 years ago

Regarding the Horizon graph idea, if we use a vertical gradient fill instead of the solid fill, then the color of the clipped parts could exactly match. I found no examples of this anywhere, surprisingly, but that just makes me more curious to try it.

Another alternative is to use a Ridgeline type of chart, in which we allow the overflow to be drawn. This requires that all the charts be drawn in the same space (as part of one chart), and it works better if the series of charts reflects another continuous or ordinal dimension of the data, rather than an arbitrary collection of discrete (nominal) data sets. Here is an example, that also uses vertical gradients.

Clipped from: https://i2.wp.com/vizfromthesix.com/wp-content/uploads/2019/10/mm2019_w43_orig.png

sgratzl commented 3 years ago

interesting ideas, do you have an idea how we handle it when we have multiple regions (like during a comparison) at the same time?

dlaliberte commented 3 years ago

If each chart is showing multiple series, then they ought to use different colors, either for points, lines, or areas. The same ideas would apply in that case, though we would probably need transparency for the area charts. Probably wouldn't be able to combine very many series without becoming too confusing.

dlaliberte commented 3 years ago

I suggest we do the following:

Use a variation of Sam's #552 for determining the maximum y value. We should only look in a neighborhood around the current selected date, not the entire domain. The neighborhood could be a couple weeks or months.
Change the rendering of the clipped line range to something fuzzy or dashed. Something that suggests it is not real data but represents a threshold that was crossed.
Allow the user to click to select a different 'current selected date', which causes the charts to rerender using the new neighborhood. Thus the user could click in the clipped line range to see what was clipped, if they are interested.

philmcguinn commented 3 years ago

Not to derail this conversation, but I wonder if we may be able to choose a default bound in advance, rather than trying to work it out dynamically when the charts are created. For the doctor's visits example above, how long would it take to calculate the outliers in that whole dataset and make a determination on the bounds? Then we could have a system of showing the default view, with a possible user action to adjust it in the rare cases it's needed.

If that's more work because of the number of permutations, or likelihood of future data messing up our bounds, feel free to disregard.

Back to the discussion at hand...I think a visible cue showing where it's clipped makes sense, maybe a dashed line like suggested, and a Z shape on the top of the Y axis to show that there may be data above. I like the idea of allowing a user to click to see more info in the (hopefully) rare cases where we clip something important to them.

tildechris commented 3 years ago

We can see if there are performance considerations, but just inspecting the data for bounding/outliers is linear, so it's not too expensive to compute.

I it's more appropriate to choose a good range for the "local neighborhood" than to focus on determining outliers and then excluding them from the range since it should give us a bit more fidelity with the most recent data, which I would consider more interesting.

The "local neighborhood" approach would also allow us to handle a situation where you had a very gradual change downwards. In this case you probably don't have any outliers in a statistical sense but we can still show an appropriate range for recent values.

dlaliberte commented 3 years ago

Another way to deal with the clipping is to split the vertical axis near the top and use a piecewise axis where the lower part is the same linear scale we are using now, and the upper part uses a log scale that ranges from the clipped value to the max. This would therefore actually show the full data unclipped, though the upper part is very compacted. User interaction could drag the split point to see more (or less) of the upper part.

dlaliberte commented 3 years ago

Rather than picking a specific amount of time (e.g. 2 weeks, or 2 months), another reasonable heuristic is to use a fraction of the total time from the first to last date, say 1/3. Also, display the range before and after the selected date (where the red line is) without clipping, by dividing the non-clipping window in equal halves if possible, or by bumping the window back if we are too close to the end or beginning of the axis.

I'm working on this issue as part of fixing #589, and perhaps #615 too, since they are closely related.

dlaliberte commented 3 years ago

I think we can consider this closed now, given the changes made for #618.

cmu-delphi / www-covidcast

improve top 10 domain range #507