DataBrewery / cubes

[NOT MAINTAINED] Light-weight Python OLAP framework for multi-dimensional data analysis
http://cubes.databrewery.org
Other
1.49k stars 312 forks source link

Provide drill across #245

Open grugnog opened 9 years ago

grugnog commented 9 years ago

Right now to drill across, users need to be implement this logic as part of the analytics/reporting system, which is fine - but can be quite a bit of work in practice. It would be really convenient if there was a function that could assist with this.

In terms of architecture, it seems like this could perhaps be implemented as an additional layer on or beside the backend that could perform multiple queries and aggregate the results. This would make reuse easier (compared with having cubes viewer make multiple queries directly, for example).

Rough sketch of how this might work: /cubes/<cube1>+<cube2>+.../<aggr-options>/<query>

The would potentially be any query that is accepted by the "cube" endpoint. The logic would probably first need to parse the query and validate that each cube has each of the dimensions used for that query (it would be nice if cubes didn't need to have identical dimensions, as in #169).

Then the backend would perform the queries on each cube and aggregate the results. Assuming there are more than one way to aggregate the results (beyond just extra columns for each cube) having argument might be useful to allow the user to tweak that.

I can see more complex stuff like cross-tables that wouldn't fit the above pattern, but am not sure if it makes sense to try and include them (and if it does, if they could sensibly use this format without making it more complex for simple stuff).

Stiivi commented 9 years ago

This feature can be indeed added to the SQL query generator, however it has to be refactored first. See reasons in this post.

As for the API, my suggestion would be just to extend the original interface with one more argument: across. The + in list of cubes is interesting idea, but what would server say on /cube/foo+bar+baz/model instead of .../aggregate?

My suggestion would be: /cube/foo/aggregate?across=bar,baz&drilldown=something and in Python: browser.aggregate(foo, drilldown=something, across=[bar,baz]).

How does that interface sound?

As with joining, I'm not quite sure yet how difficult it would be to implement, but I'll just throw few pointers and notes here for future reference:

Stiivi commented 9 years ago

Few notes based on current work at branch sqlrefactor:

grugnog commented 9 years ago

Thanks for looking into this. I think the interface sounds OK.

One question I have is if there is some semantic or functional meaning in having a "primary" vs "secondary" cube in a drill-across query? The cube (in the path element) feels more "primary" relative to the ones in the "across" argument, and it feels like it might suggest that it is treating them somehow differently. So if there is semantic or functional meaning between these, I think that distinction is a benefit - if not, then I think it complicates the mental model a bit, since it suggests there might be.

Also, to clarify - my suggestion was to use a route "/cubes" (plural) that only handles drill across queries, and perhaps other operations involving >1 cubes. This would be in addition to the "/cube" endpoint. Hence I don't think the /cube/foo+bar+baz/model issue would come up, since that would only be an available operation on /cube/*/model.

Stiivi commented 9 years ago

As for "primary" vs "secondary" – I'm just following current interface which might also impose some limitations on the "secondary" cubes. This distinction is definitely not a benefit and as you say, it complicates the mental model a bit. I'm not in favor for those limitations, but I take it as an intermediate step.

I was thinking briefly what would have to be changed to accommodate queries across multiple cubes. The current AggregationBrowser (in browser.py and its subclasses) is based on a premise that there will always be only one cube to be queried at a time. It holds a reference to a localized cube being browsed, cube's mapper and a reference cube's store. Based on this assumption, the Cell has also reference to a cube. There is quite a lot of historical tight coupling in there that has to be untangled.

Browser

As for Browser, we can either add cubes instead of cube and have the browser tied to the multi-cube schema. Another option would be to use browser as a pure functional/controller, tied only to a store, that will receive cube(s) as an argument to every query. I have no opinion on either of the two right now. If you do, please, let me know.

We need to have an object which knows about all joins and tables. Currently called StarSchema on the sqlrefactor branch provides that functionality (uncoupled from Cubes model complexity). This object can be easily changed to accommodate multiple cubes. But then we would need to create (lazily) one star schema object per cube combination.

Cell

As for Cell – there are some methods that were meant to be of "explorative" nature, such as drill-down. which are using the Cell-linked cube. We can either:

A. make Cell dumb and use only string based references, move the explorative functions to some other utility object keeping a state (†). No cube reference in a cell. Cell would be "portable" to any other cube with the same dimensions. B. have Cell include multiple cubes. Keep the explorative functions. C. keep the cell single-cube. No changes needed, but introduces another mental model complexity.

(A) feels lighter, but will require additional methods for checking validity of the dimensions in the cell.

Model

Another radical option would be to bring back concept of a "model" as a closed set of physically related cubes. There would be one Schema object with all tables and joins pre-analysed that would generate underlying star JOIN compositions. Every Cube will belong to a Model and Model will be tied to a `Store. Browser, regardless of design decision from the options above, would use the store to provide the appropriate schema composition for given query.

This might reduce some complexity, but require more work as it will affect the Workspace as well.

Summary

We have a design chain reaction here, for historical reasons.

Besides SQL backend query generator lacking the ability to include multiple fact tables, there is a bit more in the library that was based on single-cube query assumption. This definitely needs to be redesigned.

I will keep this in mind during this refactoring. I don't think that any of the above will happen in this round, if I want to deliver more understandable query generator, but it has to happen soon or later. For the time being we will have to stick with primary/secondary cubes in the drill-across.

Anyway, any input – thoughts, pics, questions – on this matter is appreciated.


(†) Off-topic note: Here I kind of regret calling Browser a Browser as it is in current implementation. It would be nice to have Browser as explorative class holding together cube and currently viewed cell. Useable from within environments such as iPython Notebook. The aggregation/query providing class would have a different name and functionality of current AggregationBrowser.

Stiivi commented 9 years ago

@grugnog (forgot to respond) I see your point with /cubes/foo+bar+baz. Agreed.

dtheodor commented 9 years ago

Imo, this project is young enough that you should push class name and API specification changes sooner (in version 2) rather than later (= closer to never).

Stiivi commented 9 years ago

@dtheodor good point.

How to change the concept of AggregationBrowser though?

Suggestion A

The usage would be:

cube = workspace.cube("sales")
# or
cube = provider.cube("sales")

result = workspace.aggregate(cube.aggregate["value"])

# or as a string `cube.aggregate`
result = workspace.aggregate(["sales.value"])

# or from a store:
result = store.aggregate(cube.aggregate["value"])

Workspace can work with string references, since it can resolve them into namespaced cubes. Store would require Aggregate objects from a cube.

All attribute objects would have to be back-linked to their owners (model objects) such as cubes or dimensions.

HTTP API

With this change the HTTP API will not have to be changed, cube-oriented API will stay as: GET /cube/sales/aggregate and the aggregate oriented API (implicit drill-across from cubes within the same store) would be: GET /aggregate?aggregates=sales.agg1,sales.agg2

To list all aggergates: GET /aggregates which will make aggregates to appear as top-level objects to the front-ends.

Therefore instead of GET /cube/cube1+cube2/aggregate I would go with GET /aggregate.

Suggestion B

The first suggestion has several issues. The main is, that we can't have multiple models using the same store (database). Therefore we need to have an object that will contain the original model together:

Side effects:

Not going to evolve this part deeper here, more should be discussed in #256.

Cube can be requested in a similar way as in the suggestion A. Aggregation done through workspace or browser – similar principle as in A.

HTTP API would be the same as in A.

Conclusion

Regardless of implementation, the best way to implement drill-across is to hide concept of cubes completely in the aggregation call and use just aggregate list. Cubes would be resolved in the aggregating objects and the aggregation would be computed or rejected based on the possibility of the given combination of aggregates and dimensions (same store, shared dimensions, other rules ...).

Thoughts?

Stiivi commented 7 years ago

See also #273