bokeh / bokeh

Interactive Data Visualization in the browser, from Python
https://bokeh.org
BSD 3-Clause "New" or "Revised" License
19.22k stars 4.18k forks source link

Extend crossfilter plugins/chart builder concept for quick general-purpose dashboards #1811

Closed rothnic closed 9 years ago

rothnic commented 9 years ago

This is something I noticed after reviewing #1735. I think there is a lot of similarities behind the idea of the chart builder (added by #1735) and the crossfilter plugin (added by #1774). They are just marshalling between the user and the plot. The crossfilter plugin is marshaling between the user, the ui, and the plot.

You could think of this more generically, that the builder knows the requirements of the plot it is generating, so a more specialized builder could auto-generate a simple interactive UI for a given data source. This feature would be a step towards being able to quickly piece together multiple high level charts into a dashboard that shares the same data source.

This would be possibly less efficient than using the low level interface for things like crossfilter, but would provide a method for more quickly building interactive dashboards.

Below is a diagram showing the concept: image

fpliger commented 9 years ago

@rothnic great feedback and interesting ideas so far! I'll take a better look at #1774 but I think your thoughts about providing marshalling are right. Also for what concerns the broader idea of the dashboard.

One thing that is not being shipped with #1735 but was discussed and is going to be shipped soon is a charts composer object that basically let's you use different builders that share the same chart.

IMHO the idea to inject better control over the source and [how it can be] shared by charts to compose dashboards is GREAT (and I probably shouldn't require so much work).

damianavila commented 9 years ago

I agree @rothnic, there are really great ideas... I would love to see "multiple high level charts into a dashboard that shares the same data source" and I think this would be a big hit: "provide a method for more quickly building interactive dashboards", and I add, "powered by charts", just for completeness :wink:.

bryevdv commented 9 years ago

@rothnic first I just want to reiterate, your thoughts and inputs here are incredibly valuable and are very much appreciated. Moving towards better and easier ways for everyday users to stand up dashboards is probably one the most important goals for the next 3-6 months. I think right after 0.8 release in a few weeks will be a good time to hash out a plan to factor and consolidate in ways like this and rally push on the dashboard front. This might even include a UI "dashboard builder" but to make that work will require clean factorization in ways like you suggest here. I am really excited for all the ideas and help you are bringing, thanks again!

bryevdv commented 9 years ago

BTW if you are planning to keep contributing to Bokeh we should probably discuss adding you to @bokeh/dev (if you are interested in that).

rothnic commented 9 years ago

@fpliger yeah, I factored out the actual building the plot functionality of crossfilter into a plugin. I didn't spend a great deal of time on the design, expecting the charts changes might impact me. I think the implementation of the builder is great, so I'm going to take a look at how it might be applied. The composer is also a great concept for this kind of thing.

If we exposed the potential configuration of the chart, either from the chart, or via a builder associated with the chart, we could do something along the lines of this pseudocode:

class SpecialChart(ParentClass):
    x = DataSourceField('continuous')
    y = DataSourceField('continuous', 'discrete')
    color = DataSourceField('discrete')
    transparency = Bool
    shape = Enum('circle', 'square')

    required_fields = ['x', 'y']
    optional_fields = ['color']
    options = ['transparency', 'shape']

The result of exposing the requirements of this chart would be you could generalize how to build the control elements. This could include validation for the type of field accepted per control, and which actually need settings before initially plotting. You'd get an interactive chart with a generic layout similar to this:

image

pzwang commented 9 years ago

So, here's a concept: what about using Vega as the declarative expression for the Builder? That would then enable us to use Lyra (https://github.com/uwdata/lyra) as a graphical builder, and we could focus on making BokehJS a server-side-data-aware runtime for Vega and Bokeh JSON.

rothnic commented 9 years ago

Here is one example of a python-based approach for Shiny-like quick dashboards: Spyre. He uses a configuration based approach, which is similar to what I used in the past, but can be difficult from the user's standpoint. I like the way that hasprops objects declare the types via class properties, which is similar to Django or other ORMs. This seems effective since you have a bunch of pre-built types through the API to work with.

If you take something like the BoxPlotBuilder, I think more explicitly defining those properties would provide enough information to key off of. For example, if another property was added as:

groups = Either(DiscreteField, List(ContinuousField))

Then, you could both handle identifying the UI elements to add, and support less processed data. The boxplot example shows pre-processing it before inputting into the builder, and appears to assume you want to plot each field as a box.

For something like a box plot or histogram there are two key requirements, a field containing the values (continuous), and an optional field containing categorical/discrete data. So, if you have a non-preprocessed data set, you can derive the groups from a single field of discrete data as an alternative to having the values separated into different fields already. This is ideal for more interactive analysis, since you can modify the values field and/or grouping field interactively.

My question is, would this more explicit requirements specification better exist as part of some wrapper around the Builder, or be added to the Builder?

rothnic commented 9 years ago

Some initial work towards the described concept (pulled out of crossfilter) is inferring the column type from the column's data, so that we can provide a mechanism to easily communicate what kind of columns should be included in the input widget. This functionality would mostly be exercised on initialization of an interactive chart widget to remove much of the need for a lot of logic to handle edge cases. I searched around and couldn't find anyone really trying to classify data in a way that affects plotting, without already knowing the contents of the data.

In my use case, we have many tables that might have 50+ columns, with varying mixes of numbers stored as strings, alongside nulls, categorical data with a single category, references to "ids" that are kind of discrete/categorical, but with far too many unique values for effective plotting. I just want to load up the entire thing and play around with it in a generalized tool, where I don't need to classify each column ahead of time, and without worrying I'll select a column with "bad" data.

For example, you may want to say that a bar chart's y axis will always be mapped to Continuous or Discrete data, but not Categorical (to aggregate values). Or, you may support grouping in a Box Plot by any column that has categorical data, but only if the length isn't beyond some length. This correspond to the respective configuration:

# bar chart - y selector of only continuous/discrete columns from ColumnDataSource
y = Either(ContinuousColumn, DiscreteColumn)

# box plot - group selector only contains columns that are categorical and have less than or equal to 10 cats.
group = CategoricalColumn(max_cats=10)

This initial hierarchy is shown below: columntype

fpliger commented 9 years ago

I like the way that hasprops objects declare the types via class properties, which is similar to Django or other ORMs.

Yes, indeed. That's something we have discussed recently. @bryevdv have already added some of this in the recent PRs merged to charts. There are many benefits about adding the HasProps dependency, both on the Chart and the Builder(s) layers.

My question is, would this more explicit requirements specification better exist as part of some wrapper around the Builder, or be added to the Builder?

You mean with the input pre-processing? If I got your question right I think (at least for now) this should live into the builder as its "internals" because builders are pretty much self contained now. That said... I feel I'm not getting all your intentions. So please, correct me if you mean something more then just the preprocessing the builder data input.

In my use case, we have many tables that might have 50+ columns, with varying mixes of numbers stored as strings, alongside nulls, categorical data with a single category, references to "ids" that are kind of discrete/categorical, but with far too many unique values for effective plotting. I just want to load up the entire thing and play around with it in a generalized tool, where I don't need to classify each column ahead of time, and without worrying I'll select a column with "bad" data.

Wow, that a very explanatory use case and a great goal actually.

Recently I've been thinking about what we should be adding/changing to the charts interface to build this kind of things. I think the recent changes gave us many advantages on this direction. The next things I see as next steps are:

I'm very excited about this conversation and feel like it's the right direction. Would love to hear other @bokeh/dev opinions.

pzwang commented 9 years ago

@rothnic Spyre looks like a neat little project, and I don't think I've seen it before. The idea of a declarative data structure for expressing the possible datasets over which a visualization/dashboard can be built is squarely in the domain of what Vega is trying to solve. My general feeling on that side of things is to be very cognizant of the goals of Vega (and Lyra), and interface with them if/when appropriate. But Bokeh's goals are definitely a bit different. I'm glad to hear that you appreciate the ORM-like model for building reactive objects which comprise the plot. This scenegraph approach is definitely a key differentiator for Bokeh, although it's a little geeky and low-level.

In terms of type inferencing on columns, I'd like to ping @cpcloud, @jreback, and @jayvius to see if they have any input based on their pandas and IOPro experience.

Regarding the column type hierarchy (nice graphic btw), I think that it would be good to keep in mind the four Stevens measurement types: nominal, ordinal, interval, ratio. We don't have to name them exactly that, but we should make sure that e.g. the definition of DiscreteColumn directly maps to the definition of a Nominal measure. Also, from a measure theory perspective, Categorical measures generally tend to be nominal or ordinal, but this changes dramatically when those categories represent group-by or binnings of underlying real number scales. In those cases, the categories (e.g. month of the year) can actually map to an Interval measure, and in the case of binned timestamps, one can accurately compute Ratios.

Just to clarify: My point in bringing this up is not to overthink or needlessly geek out on the problem, but rather to make sure that we are all aware of the broader data/info visualization context that should inform a taxonomic approach to structuring data types.

aschneiderman commented 9 years ago

@bryevdv When you gear up after 0.8's release to work on dashboards, you might want to check out the new version of Microsoft's dashboard/data visualization tool. What it can do very quickly with Salesforce data, for example, is pretty impressive.

It might be a useful source for inspiration -- and it's got a good chance at becoming one of the tools that your work will get compared against given that Microsoft has finally decided to really push it, using a more aggressive about pricing strategy.

rothnic commented 9 years ago

@pzwang thanks for the references. Yeah, I'd like to reuse other's work as much as possible.

The impression I got from Vega was that it is more in line with part of what bokeh is doing, serializing chart objects, than tackling the dashboard problem. It does have a basic data and a pretty extensive data transform language defined, which should be useful. Lyra seems focused on providing an interactive layer for linking data to aspects of the visualization, which would be useful to (non programmatically) generate the operations the builder is performing.

What I was looking at was how can the builder describe its requirements so that another tool could automatically generate basic ui controls for it. A unique problem to dashboards is that certain configurations of columns can put you into "bad" states that you must handle gracefully. If you handle this internal to the app, it can become a spaghetti of conditional statements. However, if we more explicitly define the requirements of what fields should or shouldn't map to an "aesthetic" (a la ggplot aes) of a plot, then the interactive app is much more simple.

I would mention that this capability mostly makes sense for columnar data sources and plots that edge towards the more general side. Then, once that works for one plot, can I then set up a grid of plots that might share controls or filtering on a shared data source.

rothnic commented 9 years ago

@aschneiderman thanks, never heard of that. It does seem to follow a similar approach as Tableau, which is a big inspiration source for me. Microsoft does have an interesting summary view that you can drill down into a more detailed version of.

One of the things that both have is a "get data" function. This is a general dashboard component that I'd like to work on at some point if I can get more time. Quickly connecting to data whether it is csv, hdf5, a sql server, or maybe a blaze server would be nice to be able to utilize in any custom app. Blaze would provide a nice abstraction layer to getting at all of those things.

damianavila commented 9 years ago

Quickly connecting to data whether it is csv, hdf5, a sql server, or maybe a blaze server would be nice to be able to utilize in any custom app. Blaze would provide a nice abstraction layer to getting at all of those things.

@rothnic, regarding Blaze, I don't know if you are aware that yesterday we merged #1713 (see #1635 too) :wink:

rothnic commented 9 years ago

@damianavila thanks, saw it was in the works, but didn't know it recently was merged in.

aschneiderman commented 9 years ago

@rothnic that would be really nice.

If you end up building it, you might want to consider setting it up so that you have the option of either charting the data as is or doing a simple aggregation of it, essentially creating a pivot chart. If you're already using blaze that should be pretty straightforward, and it would great expand the component's power without adding much complexity for the user.

Re: Power BI: the reason you and most people haven't heard of it is that until recently, Microsoft's business model for it was nuts. You could only use it though Sharepoint or some other Microsoft products, it was expensive, and the licensing was complex. Even Microsoft shops like mine stayed away from it. Apparently somebody woke up and realized this was not a winning strategy. Now that they seem to have decided they're competing with Tableau, you'll be hearing a lot more about it in the future -- especially since apparently they're going to seamlessly integrate R into it (part of why they aquired Revoltuion Analytics). The reason I suggested checking it out is that in addition to what it does well now, in the next year or two I think they're going to be focusing on doing what Tableau does only easier, so there may be some UI ideas/tricks that are worth borrowing. Personally I'm a little bummed about it, because I'm afraid it's going to make it considerably harder to convince nonprofits to try an open source alternative like Bokeh -- another reason why I'm rooting for the dashboard ideas in this issue thread to succeed.

pzwang commented 9 years ago

@rothnic I absolutely agree that it's much better to take the Grammar of Graphics approach of describing "what" one wants to see rather than "how" one wants to see it. Some of the research work from Chris Stolte & Pat Hanrahan before and during the early years of Tableau provide some conceptual groundwork here: look for their papers on Polaris and VizQL. These things formed some of the foundation for the "ShowMe" capability in Tableau, which I think was more of a focus before Tableau decided it should just eat all the low-hanging fruit that Excel was leaving lying around.

We might also want to take this discussion to email, where quoting is easier and it's more of a forum for open-ended discussion.

rothnic commented 9 years ago

@fpliger one of the things more specific to the difference in the builder and the crossfilter plugins is that it looks like you want any grouping performed ahead of time with the builder. It looks like it isn't required to to be that way, but it has been the approach.

For example, if you take the autompg data, you could look at it as a DataFrame, or a Dict of lists of values, where the columns are the keys. If we wanted a boxplot of horsepower as the values, and the boxes to be for each cylinder, then we'd need to group the data ahead of it going to the builder. You'd want to see labeled data, whether df, dict, etc that has keys directly into the data to plot. Where do you think it would fit to have an interface to work with the non-grouped data? Built into the data adapter as an option, at a higher level wrapping the builder, or something else?

I think a data agnostic grouping, possibly via the data adapter, would reduce the up front work that has to be done.

fpliger commented 9 years ago

@rothnic we seem to be on the same wave of thoughts frequently :-)

This weekend I've started collecting ideas about the status of the Charts, this discussion, #1841 discussions, recent discussions I had with @bryevdv while working on the builders. I think that the recent big change on charts is a huge step towards the right direction but as you've mentioned there are still several tweaks and improvements needed.

To be more specific we recently have discussed the issue you brought up regarding charts eager data computation/grouping, data sources creation and an idea from @bryevdv regarding a way to make charts build named datasources that could make sharing sources between charts a lot easier.

I find your idea very interesting but I'm afraid I miss a few pieces (or maybe it needs some thoughts to make it match with the charts design) and need some time to elaborate... Find a good (and possibly simple) way to be data agnostic, accept "any data" and group/consume that data on demand is an awesome idea. This is basically the foundation to build higher level rich extensible tools that can be expressive enough to allow users to say what they need instead of how (which I feel being a shared view here)

Maybe a design that allow us to have shared named data adapters with the ability to:

with a shared named sources would allow us to just accept a bag of data from the user that then can tell the Adapters/Charts what they need on specific subsets of the data provided.

Thanks for the feeding thoughts (that's a great breakfast for a Monday morning :wink: ).

rothnic commented 9 years ago

@fpliger the thing that makes this challenging is supporting multiple data formats. While ggplot has a strange interface, the thing that simplifies their charting interface for quick plotting is by choosing one data input format. I think here, we'd like to not marry ourselves to the python equivalent, pandas DataFrame, but still support it. This eventually led me down the path of looking at blaze, or extending ColumnDataSource with grouping functionality. Yesterday I went looking into data agnostic grouping and posted this to blaze. You'll see matt's response.

I both looked at using toolz/cytoolz groupby for generic grouping. I then wondered if blaze would be able to do that grouping for us. You give the chart a dict, dataframe, or database, and you'd group and reduce the data to the format needed for plotting. Like matt points out, due to not knowing the size of the data set, you have to do the grouping and reduction in one step. With the boxplot I see three different layers of formatting:

  1. Standard column data source, rows are records, columns are dimensions
  2. Indexed (grouped), where you have direct access to the different things to plot
  3. Reduced, where you have the iqr, max, min, mean, etc for each group.

Right now, the input is 2, which is ok to support. I just think most people using a charting library at this point would also like for it to support converting from 1 to 2, as well. What matt gets to is if the data is very large, you really need to go directly from 1 to 3. I suppose you may not want to marry to blaze either, but possibly as an option would make things both data source and size agnostic.

If we don't want to utilize blaze or pandas grouping, toolz/cytoolz groupby looks like it would do the job. You'll see in the post how I get from a dataframe to the list of dicts via pandas to_dict(orient='records').

pzwang commented 9 years ago

@aschneiderman PowerBI, PowerMap, and PowerPivot are very powerful and nice tools. A year ago I was teaching some classes to researchers on how to use Azure and I demoed these things, and was surprised how few people even knew of their existence.

You're right that PowerBI is Microsoft's "Tableau-killer", but it is very Azure-centric (which makes total sense, since Azure is Microsoft's new platform strategy).

Personally I'm a little bummed about it, because I'm afraid it's going to make it considerably harder to convince nonprofits to try an open source alternative like Bokeh -- another reason why I'm rooting for the dashboard ideas in this issue thread to succeed.

I appreciate that you appreciate the value and necessity of open source tooling here. But I think we should never be bummed to see the bar getting set higher - it helps wipe out legacy technologies in rent-seeking mode, thereby opening doors for alternative innovation (both paid and open) . My particular hope is that Bokeh gets to a point of maturity where people can contribute and use it as a base to tackle visualization problems in various domains. Given how much data is going to be drowning the world, I don't think we're going to have a lack of opportunity, nor is the world in danger of having too much accessible and insightful exploratory viz tools. ;-)

pzwang commented 9 years ago

@rothnic I'm glad you ended up looking at Blaze when considering the "multiple formats" problem. This is one of the original motivations for it. I wanted to go beyond merely the dataframe that gpplot uses, and the simple "list of tuples"+custom javascript that d3 uses. Instead, the hope was to have a more flexible structured data decriptor that could serve the purposes of distributed and out-of-core computing as well as specifying novel visualization.

In the concrete terms of your boxplot example: you're exactly right that with larger data, #3 is a query or transformation on #1 on that needs to be handled by something really outside the visualization problem, because it's needed by more things than viz. However, you want the visualization system to know that it's looking at a dataset that is the result of some aggregation/transformation, because then various tools and such can be smarter about how they present information, drill down, etc. So, it's very nice to have a blaze expression graph that Bokeh can reason about.

@fpliger For context, I think it's entirely appropriate and wonderful for Bokeh to rely on Blaze for communicating about metadata, datashape, and the like. In the short term, we also need to provide some mechanisms so people without Blaze installed can still use the library. However, longer term, we will want to make sure that everyone can easily get Bokeh and Blaze, and if we have to trade off a little adoption friction for more power, we can bias for the latter.

rothnic commented 9 years ago

I did some work towards breaking out the faceting part of crossfilter, into what might need to be refactored into a builder. I spend some time looking at grammar of graphics to try to get the faceting language correct. I think I still need to fix it up some, but wanted to get things working to start. Two files to take a look at:

1) facets.py 2) test_facets.py

You'll see I played around with using the grammar of graphics notation for multiplying/dividing facets, but not sure that would make sense as an easy to use method. Instead, I have a FacetGroup base class that represents some generic layout for x and/or y faceting. FacetGrid extends that concept to place the plots into a grid. Right now it is a little verbose, but I think we should have something like this:

scatter_builder = ScatterBuilder(x='mpg', y='displ')
show(FacetGrid(data, x='cyl', y='yr', builder=scatter_builder))

This creates a builder configured to plot the data the same way each time, but we leave it to the FacetGrid to handle enumerating the facets, filtering the data, generating each plot and placing it into the grid.

fpliger commented 9 years ago

Very cool!! Very happy you are working on this. This week I've started putting down some idea on this direction and hopefully will continue next week.

I'm not sure about the syntax, specially having the x, y on both Builder and FacetGrid. Also is the ScatterBuilder you are using here the same we have with current charts implementation (so it's actually returning a Chart instance) or is it some different implementation (so it actually behaves similar to scatter_builder = partial(ScatterBuilder, x='mpg', y='displ'))?

rothnic commented 9 years ago

@fpliger I was just writing pseudocode, so likely not consistent with where you are currently at with charts.

The reason there is x and y on both (which could be named differently to avoid confusion), is that the plot itself will use an x and y for mapping to the aesthetics of the plot. For scatter, it is literally x and y, but for a barchart it would likely be different. However, faceting also uses an x and/or y, which typically would differ from the plot's x and y. If you look at ggplot, you might see something like this as an equivalent for what I outlined:

p <- ggplot(mtcars, aes(mpg, dipl)) + geom_point()
p + facet_grid(cyl ~ yr)

I'm glad you reminded me about using the partial, since I think that would be a good approach. Maybe this might allow the facet grid to grab the full dataset from the builder, then pass in the filtered data on each call to the partial function. Not sure whether providing the facet grid or the builder the full dataset would be better.

The example I created was much simpler, so I need to try using a builder. If you look at test_facets.py (I linked to an old version, sorry), you'll see that i'm just creating a function that returns a figure with scatter called on it.

rothnic commented 9 years ago

One other thing is that specifically x and y in the FacetGroup is probably too specific. For example, if I created a Facet Tab, then it would really just be one dimensional, where the dimension is the tabs.

With crossfilter, you can combine Facet Tab with Facet Grid so that you can facet by one dimension by the tabs of plots, then facet by x and/or y in each tab. So this gives you 3 dimensions to plot plots, if that makes sense, on top of the two dimensions used for actually scattering the points. So, I'll think through how to compose the faceting operations.

"Facets are embeddings. A facet of a facet specifies a frame embedded within a frame. A facet of a facet of a facet specifies a frame embedded within a frame embedded within a frame. Each frame is a bounded set which is assigned to its own coordinate system ... a four-dimensional graph can be realized in four 1D frames, a 1D frame embedded in a 2D frame embedded in a 1D frame, two 2D frames, a 3D frame embedded in a 1D frame, and so on."

  • Wilkinson, L. The Grammar of Graphics
fpliger commented 9 years ago

@rothnic Thanks for clarifying. I agree we need to be careful about how we manage levels.. It's also important to define how to share the data between the original charts/source and the created facet frames in order to enable nice interactivity. My plan is to start working on some basic concepts of this shared_source on charts this week.

birdsarah commented 9 years ago

Discussion is now more than 3 months old. Can re-open if renewed interest.