Quansight / omnisci

Explorations on using MapD and Jupyter together.
4 stars 1 forks source link

Vega, Datashader, and Holoviews Collaboration #67

Open saulshanabrook opened 4 years ago

saulshanabrook commented 4 years ago

We had a call a few weeks ago with @jbednar @tonyfast @dharhas @philippjfr to discuss different ways datashader and holoviews could be useful to the work we are doing with Omnisci. I was particularly interested in all the work that has been put into creating these interactive rasterized geospatial plots (NYC Taxi example) could be reused for our current work getting interactive vega visualizations to execute on a python backend.

My takeaway from the conversation is that datashader is all about taking some data and rasterizing it. If we wanna think of this in terms of transformations on the data, it is like doing a groupby by pixel and then displaying some aggregate.

And Holoviews, despite its name, is at its core not about viewing data, but about transforming it. The key idea is to maintain enough semantic knowledge about the data as we transform it so that appropriate visualizations are implicit in the data encoding.

So if we think about holoviews as a way of transforming data, with datashader being one particular type of transform that is heavily optimized, then we can see where this can fit in our current pipeline. What do we use currently for transforming data? We take Vega transforms and map them to ibis expressions. Instead, we could take vega transforms and map them to holoviews calls. So holoviews wouldn't be used at all on the frontend for visualizing, it would just be a backend library to do the appropriate transforms, which vega would call out to when it needed to transform data. If we wanted to use our existing pipeline directly, we could try to write an Ibis backend for holoviews. However, there might be too much impedance mismatch between the grammer of ibis and that of holoviews, so instead we could write a different python backend for vega, that transforms directly to holoviews instead of ibis.

What would be the payoff here? Well, users would get to use the Altair API to construct interactive visualizations. And they would get the efficiency built into datashader for computing rasterizing data.

The next steps here would be to explore how vega transforms like groupbys and aggregates could be mapped to datashader. Before that, we should come up with a particular use case for interactive visualization with datashader and holoviews, try to replicate it with altair, and then see how we would map the vega transforms to the holoviews expressions.

Taking a step back, what we are doing here is mapping one domain specific language, Vega transforms, to another, Holoviews operations.

cc @ian-r-rose @domoritz

domoritz commented 4 years ago

That sounds pretty interesting. I don't know enough about datashader and holoviews to say anything smart before I take a closer look at their models.

saulshanabrook commented 4 years ago

I don't know enough about datashader and holoviews to say anything smart before I take a closer look at their models.

I am just starting to look into them, so if I got any of my summary wrong I would appreciate being corrected by any of the authors.

domoritz commented 4 years ago

Could you clarify how you think holoviews could be translated to SQL for omnisci?

saulshanabrook commented 4 years ago

Could you clarify how you think holoviews could be translated to SQL for omnisci?

I don't think it would be. I think for this to help omnisci directly, there would have to be a omnisci backend to datashader. I am not sure how that would work. Possibly as some UDFs that run on their server, but I would defer to the datashaders devs, if they have done any work running on existing databases.

This would be helpful for other datashaders backends. Like in the example, it is running off of a Parquet file.

jbednar commented 4 years ago

Based on that very interesting meeting (which has already started to fade from my memory, alas!), I think there were a few things that were clear about how Vega, Datashader, and HoloViews could relate:

  1. Q: Should HoloViews have a Vega plotting backend (adding to the current Matplotlib, Bokeh, and Plotly plotting backends)?

  2. A: Strictly in terms of plotting, no, there does not seem to be any reason to do so. Adding one via Altair seems reasonably straightforward, but the set of plots covered by Altair is already covered by the existing backends. Each of the other existing backends provides unique functionality (Matplotlib offers full SVG export for layouts, Bokeh offers fully supported interactivity, and Plotly offers 3D and a few other unique plot types), but Vega doesn't appear to enlarge this total capability, just provide a slightly different look and feel. Thus I don't currently see any particular reason to spend the effort to develop a Vega plotting backend, for plotting.

  3. A: Even so, there could still be a very good reason to provide a Vega plotting backend, if you think not in terms of generating plots, but in terms of generating a Vega JSON spec and grabbing that before the actual plot is generated. If there were a Vega plotting backend, then people could use the HoloViews API to create Vega JSON specifications that could be passed to any Vega rendering tool, which could fit nicely into the deployment workflows of various organizations. Getting Vega JSON specs out (not just completed plots) does seem like a potentially compelling approach, and is something we could follow up on if there were sufficient interest. But note that such a Vega spec would normally just be for the final plot in some transformation pipeline, e.g. a Datashader-rasterized scatterplot would show up as an image or heatmap plot specification at the Vega level (assuming Vega has such plot types nowadays!).

  4. The plotting backend is about output, but input is also pluggable for HoloViews. HoloViews supports many "data interfaces" (for numpy, pandas, xarray, etc. data sources). Adding Ibis+SQLAlchemy as a data interface would allow HoloViews to work with SQL databases about as transparently as we currently work with the local data sources. We're excited about this possibility, which Philipp has estimated as a 1-2 week job for him to do, but do not currently have any funding for it. With such a data interface, HoloViews users could specify a set of data lazily to be retrieved as needed for plotting, including a Datashader aggregation/reduction step before plotting. Such an interface would work the same with any plotting backend, including generating Vega specs for the final (transformed/rasterized/etc.) data as export if there were a Vega backend as contemplated above. Note that Datashader doesn't necessarily need to know about any of this; HoloViews can prepare the data into a format consumable by Datashader for any supported HoloViews data interface.

  5. As @saulshanabrook is describing above, it is also possible use HoloViews without any plotting backend at all, just to implement data transformations. Until the actual display, HoloViews objects are always data containers, not plots, and so HoloViews can capture a series of transformations into whatever form you eventually want to display or analyze, without actually ever displaying it. Splitting HoloViews from its plotting has been on our to-do list for several years, and can happen as soon as we're ready to take a couple of weeks dealing with the admin burden of doing it, so we generally just pretend that's already happened. :-) With this approach and an Ibis/SQLAlchemy data interface, we can use HoloViews to start with an SQL data source, then transform it to a different representation (e.g. rasterized with Datashader, or just sliced/sampled/aggregated in general), then read out the data to be consumed by some external system (whether that's about plotting or not).

  6. Finally, what I think @saulshanabrook is really getting at is even more ambitious than all these options: to start with a Vega specification for a set of transformations that Vega could do with small data, but then to use HoloViews+Datashader to implement the actual transformations, taking the result and displaying it with Vega. Doing so depends on item 4 and possibly item 3, plus being able to translate Vega transformations into HoloViews operations (Vega aggregations to HoloViews aggregate(), etc.).

  7. Option 6 is conceivable, if that's what you have in mind, but to me it seems like it would be more straightforward to simply add a Datashader interface to your own system, bypassing HoloViews entirely. In that case, yes, it might be appropriate to add an Ibis/SQLAlchemy (or Omnisci? not sure) data backend to Datashader, then switch to Datashader to implement the aggregation specified in the Vega spec. I'm not sure about this, and have to run now before being able to think about it deeply, but it's certainly worth considering as an alternative.

jbednar commented 4 years ago

Note that for option 6, this recent addition to HoloViews may be relevant, https://github.com/pyviz/holoviews/pull/3967. It supports storing a HoloViews transformation pipeline in a re-playable semi-declarative form. (Only semi-declarative because even though it's a text-based spec, it's really just a recipe for function calls, but at least it's constrained and introspectable and thus potentially mappable between different declarative systems...)

saulshanabrook commented 4 years ago

It supports storing a HoloViews transformation pipeline in a re-playable semi-declarative form.

Ah great, yeah that is very useful. Just curious, what was the impetus for this addition? Is it a similar use case?

jbednar commented 4 years ago

Pipeline capturing was added to support replaying the data transformations behind a visible plot, specifically in the case of selecting a subset of the data in one plot and wanting that same subset reflected in various other plots derived from the same data. See https://github.com/pyviz/holoviews/pull/3951 . E.g. if you have 6 columns and some Datashaded plots of various dimensions against other dimensions and see something interesting in one specific plot, the original data points leading to that plot are no longer available (having been rasterized away), but you can still select a region of that plot and now replay the full pipeline to update each of the other plots to show only the points that fall in that region for those dimensions, without ever having to send the full data down to the browser, and whether or not those dimensions are actually shown in the other plots. Moreover, linked selections like this can simply be enabled without any user-written callback code; they will simply be available if someone wants their plots to work like that. I think this support will cover many of the reasons that people want to set up a custom dashboard in the first place, with essentially zero code. But in general having the full provenance and reproducible recipe for each plot from a source dataset is likely to be valuable for lots of other purposes we haven't even contemplated yet. E.g. I'm hoping it can be extended to cover "drilling down" use cases with almost no coding as well, which is the second big reason people write custom dashboards (after linked selections). @jonmmease can comment on that one...

jbednar commented 4 years ago

Oh, I guess the third big reason people write dashboards has always been covered by HoloViews already, which is to show a plot that shows a slice of a multidimensional dataset, with values for the dimensions not shown in the plot being selected by widgets. That's just always worked in HV but otherwise would require writing widget code, so I tend to forget about that even more common case.

jonmmease commented 4 years ago

I'm hoping it can be extended to cover "drilling down" use cases with almost no coding as well

Yeah, this should be possible for many use-cases. Here's a good overview of how Spotfire handles configuring custom drill-down dashboards using their GUI menus (https://www.youtube.com/watch?v=a5FMokQ2CR0). The machinery we'll have in place when https://github.com/pyviz/holoviews/pull/3951 is finished should be a suitable foundation for these more flexible workflows. Of course we'll need to work out a reasonable API for the user to provide the kind of marking/limiting/combining options that the Spotfire's menus provide.

jbednar commented 4 years ago

@jonmmease , sounds great!

saulshanabrook commented 4 years ago

Hey datashader folks! Just wanted to point you to this new issue where some discussion of adding rasterization primitives to Vega Lite is taking place: https://github.com/vega/vega-lite/issues/6043

Since you all have a lot of experience designing this kind of API, I would be curious if you have any feedback on the proposal there.