[Discuss] Rendering partial results

stacey-gammon commented 4 years ago

Started to be discussed in here: https://github.com/elastic/kibana/issues/53336 but decided to open up a new ticket to discuss partial results in general, not the expression implementation.

Partial results is part of the Make it Slow effort but there has been a lot of discussion on whether it makes sense to show partial results when the information may not be a true reflection of the final results.

I personally think there is value in showing partial results even in the case of a pie chart with an average aggregation. I think we need to make it clear to the user that these are not the final results, like having the visualization be greyed out, but we should show these partial results and leave it up to the user to decide whether or not any information can be gleaned from these partial results.

I argue it is possible to gain valuable information from partial results even in the case of a pie chart and an aggregation like Sum or Avg, or Top hit, assuming the user knows something about their data. For instance if the data is numeric and always >= 0, then as soon as any slice has data in it, the user knows they found a hit where the number is > 0. With the SUM agg, they can know even more information because as long as the number is always positive, they know the numbers they are looking at might still grow higher, but they will never shrink.

Even if this is not a very common situation, I think it may be easier, technically, to always show partial results by default. The only time I think we need to be careful is when a secondary query is sent out based on data from the first. If that query is a slow query, we need to cancel it as soon as the first query sends us new values. Or perhaps detect this situation and not show partial results to avoid the extra querying overhead.

It was also my takeaway from the Make it Slow PR that we should be showing partial results for nearly all visualizations, but not everyone had that same takeaway, so let's use this issue to reach a documented decision.

cc @AlonaNadler @peterschretlen @ppisljar @timroes @alexh97

elasticmachine commented 4 years ago

Pinging @elastic/kibana-app-arch (Team:AppArch)

dsmith001 commented 4 years ago

Zoomdata called this capability "data sharpening" and used as it as a differentiator for doing BI against massive Hadoop data sets (leveraging Apache Impala).

Nice little 3 min. overview: https://www.youtube.com/watch?v=zZs-SIkwJ-g

rayafratkina commented 4 years ago

I have 2 questions:

do we know how it's implemented in the back end? what does sharpening process actually do at the query level?
do we know of specific use cases for this feature? it makes for a nice demo, but I really don't understand why I would choose to refine my visualization before knowing what it shows. Zoomdata is sort of designed to deliver real-time data changes, so I feel like they are never done loading and therefore it's sort of mute point that you can refine before it's done - you are never done.

peterschretlen commented 4 years ago

I think the challenge is as a user it's not obvious to look at an incremental visualization alone and judge whether the result is useful or not. Imagine a dashboard having some useful/converging visual results based on partial data and some not, and not being able to tell which is which.

So unless you deeply understand the data and computation being done (or can communicate the degree of uncertainty to the viewer), I think partial results in general are unhelpful at best and misleading at worst.

That said, I think there are a few cases where partial results can help:

You are doing exploratory analysis where partial and not fully correct data is sufficient to take the next step (the zoomdata example shows this kind of exploration).
Building and designing visualizations, where you iterate/experiment and don't necessarily need the full data.
When individual documents are shown, like in discover or the logs viewer. Often the individual documents can tell you something even if you don't have all of them yet.

There are a lot of academic papers that investigate incremental/progressive visualization, particularly for exploratory analysis. It might be worth doing a review to see the challenges and findings, and how they might apply here. (Just one example: Trust Me, I’m Partially Right: Incremental Visualization Lets Analysts Explore Large Datasets Faster)

monfera commented 4 years ago

Trust, but Verify: Optimistic Visualizations of Approximate Queries for Exploring Big Data from Dominik, and Danyel, Ding, Wang at MS is a favorite.

There can be value in getting an approximate response, even if the sample distribution is not representative of the population distribution, for example, it helps establish magnitudes, units of measure, approx. extent of the data. If we're lucky or we can tilt odds in our favor, then even the distribution of the first 1% of the data will be similar to that of the rest.

A related concept is the level of detail (LoD). For example, in the initial query, do super coarse binning or rely on an aggregate index even if not of the ideal resolution, as it may still give a decent histogram or heatmap; then evolve into a more granular histogram, or into a scatterplot, respectively. So the topic of "visualizing incomplete data" may cover not just missing documents, buckets or bucket contents, but also preemptive strategies, eg. intentionally "missing out" on some Level of Detail (LoD) in favor of retaining both low latency and representative results. Examples for enriching coarse visualizations:

1D barchart to 2D heatmap
coarse heatmap to fine heatmap
heatmap to scatterplot (Cartesian and geographic alike)
sparkline (from few points) to detailed line chart
singular barchart, heatmap etc. to trellised barchart, heatmap etc. for the ultimately required breakdown
coarse histogram to finer grained histogram (though for normal dist, bin count is limited by Sturges' formula)

Some other techniques are known, sometimes called adaptive sampling, weighing bin sizes with importance, for example, do temporal binning of the last 24hrs by minute; the prior 30 days, by hour; the prior 12 months, by day etc.

AlonaNadler commented 4 years ago

The partial results concerned is something we discussed multiple times during the make it slow work, with and without Elasticsearch folks.

The simple use case is Discover or any other application that shows raw documents as a result. in these cases getting results stream while the query is in progress instead of waiting for the query to return the entire results set is a better experience and can be useful on multiple occasions

When it comes to aggregations and I use the dashboard as an example. Please assume that we will provide a UI that shows clearly that the panel is still in progress (we will share here soon @mdefazio ).
If the dashboard loading, for example, takes 2 minutes, getting intermediate results after 10-15 seconds and frequently getting these results update while viewing a progress bar can be a really powerful experience one that makes people perceive the speed of Elasticsearch while also getting a good indication of the progress. If the dashboard loading takes 5 hours, for example, getting intermediate results can be a powerful capability to provide people a glance at partial results while waiting for the full results.

A lot of how this feature will be perceived is on us and how we make it clear that these are partial results that we intend to do.

timroes commented 4 years ago

So unless you deeply understand the data and computation being done (or can communicate the degree of uncertainty to the viewer), I think partial results in general are unhelpful at best and misleading at worst.

I am sharing @peterschretlen concern here a lot. I want to leave aside partial results for discover for a while, because I think there they might provide some advantage and purely focus on Visualizations and Dashboards, where they are as Peter put it unhelpful at best and misleading at worst. I think the effect gets worse for really long running queries.

Let me use a couple of examples here to demonstrate this. Assuming you create a line chart an it starts loading and you'll maybe sit there for a couple of minutes watching it evolving:

Now it needs another 10 minutes to finish loading all data (you might head for a small coffee). So how'd the final chart look like?

Right here it is:

![photo5873086831040508412](https://user-images.githubusercontent.com/877229/73254642-15e35380-41bf-11ea-84ac-dff8e69f11df.jpg)

This is a total valid and likely scenario. Before we're not having the full data, we're basically working in an uncertainty cone of 100%. Some aggregations might converge faster towards the final results others can flip constantly. No matter how much we're making sure to explain the user that this is just partial data, I am pretty sure they'll build up expectations nevertheless about how the final chart will look like. So showing them the partial data, had mainly done one thing: building up false expectations, but not providing any value to them in that case, despite the fact that they see the chart is still loading (but that's no other information than any loading spinner can convey).

There are more examples, where basically running a terms aggregation on the x-axis of a bar chart, the bars will constantly show reorder and disappear potentially, since we're not having the final order until the point we have the final data. So a scenario like the following can be very likely:

chart

chart (1)

chart (2)

chart (3)

So the only things we've done despite creating some rendering artifacts and showing some "loading spinner" in form of a chart is misleading the user about final results.

There actually is one very special niche, where partial results do make sense, and that is when we're knowing the actual buckets in advance and buckets will only come in once they are complete, but they won't change metrics afterwards anymore. This could potentially happen for two cases: A date histogram and a histogram with a fixed min/max value. In those cases if we can guarantee the metrics won't change anymore, we'll actually see proper progress in the chart. As of the discussion we had yesterday, this is not the behavior of ES at the moment, and the recommendation there would be not to actually use partial results loading for that, but basically do multiple requests from the Kibana spanning increasingly larger time ranges instead.

Since the usefulness of loading partial results in general is limited to some very special use-cases I would highly recommend we're not going for a generic solution, that will convey misleading information for the sake of showing we're still loading data to the user. Instead we could consider building that specific solution for those narrow date_histogram and histogram use-cases were we know we're not showing basically "random" data to the user (and before we don't have all data, we don't know if the deviation from the final result is actually smaller than from random data, we could show).

A lot of how this feature will be perceived is on us and how we make it clear that these are partial results that we intend to do.

As shown above I don't think that's entirely true, since I don't think we'll have a way to design it, that will basically disable human's pattern matching algorithms within their brains, so we don't mislead by the partial information.

If we want to nevertheless go the route of partial results for generic cases, I would second Peter's suggestion here, and we should have a couple of people with good data scientist experience work through some of the research on that topic and end up with a good recommendation on what and how we should address this.

ppisljar commented 4 years ago

@monfera i like the document you posted. but that goes beyond just showing partial results elasticsearch would return. if we would have a way to show the uncertainty that would be great, but i think that is a big project on its own. And as Tim mentions with aggregations we have no way to measure the uncertainty and it could theoretically be close to 100% till the last shard returns data.

Maybe we should look in the direction of handling this with multiple queries to es as suggested by es team in yesterdays meeting. (requesting last day, last 5, last 15, last 30) for when it comes to aggregation.

rayafratkina commented 4 years ago

Can we explicitly separate discussion of timeseries data from other types? I think there are clear useful ways to load incrementally or display partial data for time series. I am not sure about other visualizations...

AlonaNadler commented 4 years ago

Great discussion. I understand both sides and this was discussed multiple times within the extended team with ES. At this point the decision to implement partial results. We believe it is a feature that will benefit our users in most cases. We will make sure to explicitly emphasize that the results are partial when showing them. Thanks for raising all the concerns here and sharing great links. We will use some of the ideas here when creating the design and share them soon. cc: @mdefazio @VijayDoshi

peterschretlen commented 4 years ago

I don’t have a problem going ahead with design and engineering assuming we want to pass in-progress data through to the visualization.

I assume the ability for a visualization to accept in-progress data vs how (or if) it renders that data can be treated separately? If that's true, a final decision of how in-progress results are shown to users doesn't need to be made now.

We should move forward, but let’s revisit after thinking through the design, trying to account for concerns raised here. I do think it's time to stop discussion in the abstract - we surfaced some valid concerns but we're not going to progress much further in this issue without something concrete to discuss.

peterschretlen commented 4 years ago

Summary of where we are today:

There seems to be consensus that:

We can improve perceived speed/responsiveness by providing feedback on progress, and this is beneficial to users.
In some cases partial results have utility. Raw documents and timeseries are examples.

And I think the concerns are:

Partial-results risk being misinterpreted. In general (there are exceptions) you can’t guarantee correctness and results can be misleading. We don't have a way to measure uncertainty or confidence, just completeness in terms of # of shards.
If there’s no utility in the partial result, it becomes just a fancy way of providing feedback on the progress. Perhaps the focus should be on the progress, not the partial result.

AlonaNadler commented 4 years ago

Thanks for the summary Peter. As mentioned above @mdefazio will share the designs once ready and we would love to get this team feedback.

lizozom commented 3 years ago

As we progress with https://github.com/elastic/dev/issues/1209 and as queries potentially become longer, the importance of showing partial results is increasing.

While, like @peterschretlen mentions in his summary, in some cases, partial results might be misleading, in others, especially when looking at normally distributed or at timebased data, partial results might be useful and allow users to optimize their workflow.

The current focus is making sure that the UX of core applications (Discover, Visualizations, Lens, Dashboard) is very clear as for if a user is viewing a partial results. Solutions integrating partial results, should be aware of this and consult the @elastic/kibana-design team.

ppisljar commented 2 years ago

Thank you for contributing to this issue, however, we are closing this issue due to inactivity as part of a backlog grooming effort. If you believe this feature/bug should still be considered, please reopen with a comment.

elastic / kibana

[Discuss] Rendering partial results #55408