[SIP-94] Support star schemas

wernerdaehn commented 3 years ago

[SIP] Support star schemas

Motivation

While I am absolutely fine with the decision to base a diagram on a single table/view only, most databases cannot cope with that. A typical view would join 100 tables and have 5000 columns. Take sales orders as example: For a proper self service BI you would join order header, order line item, three times the customer mater table for soldto-shipto-billto, the material master table, tons of tiny lookup tables with e.g. statusid, statustext etc. If the joined table has a foreign key defined and the column is not-null, then the database optimizer can ignore that join in case none of the columns are used. Only a few databases are that advanced. Further more, for many visual clues like slice/dice and filters a distinct list of attributes is needed. What is faster? Finding the three order states in the state text table or a select on the text table itself?

Hence I would suggest to support star schema data models. A chart is based on a fact table. The list of columns shown are all columns of the fact table plus all columns of all tables this fact table has a FK relationship with.

Proposed Change

In addition to tables there are virtual objects FACT_TABLE and DIMENSION_TABLE. These simply point to physical tables. Note: A fact table in one data model can be a dimension table for another. Example: ORDER is a fact table with the measure ORDER_AMOUNT but is a referenced dimension table for the BILLINGS fact.
There is a relationship between facts and dimensions. Only trivial PK based equal conditions are allowed.
Charts can use tables or FACT_TABLEs.

zhaoyongjie commented 3 years ago

@wernerdaehn Thank you for your comments. I think stars or snowflake model is OLAP Dataware house concept, for more modern BI prefer to use single-table model, like PBI/Tableau.

What do you think?

wernerdaehn commented 3 years ago

@zhaoyongjie From a technical side, if all databases would support views with a virtual unlimited number of columns and do a perfect job from the SQL optimizer point of view, then yes. But the opposite is the case. Only Oracle And SAP Hana (Calculation Views) are even close to that and far from perfect. From a logical point of view, Superset wants to sum up measures and group by attributes. That is a dimensional concept. Further more, if the tool knows what a fact table is and what a dimension, we have much more options later. For example you might have an ORDER and DELIVERY table, both have a shared dimension CUSTOMER. In a single-table model the customer table with its 500 columns must be defined multiple times whereas with a shared dimension just once. And in the dashboard, when you want to view the order amount, shipped amount and order backlog, a shared dimension allows to filter all measures at once. You know what is shared and how. Also for the users the selection of columns is much easier when they are grouped. The customer master dimension obviously provides all columns that are customer related, the material all material related fields. In a single table model you would have 300 columns from the one and 300 columns from the other. And if customer can be filtered on sold-to, ship-to and bill-to basis, you would have the 300 customer fields three times. And finally, while today the semantic layer of Superset is very thin, it will grow. And then specifying the same information multiple times get even more of a burden.

So no, this has nothing to do with OLAP, it is related to being a business intelligence tool.

I see that as something fairly straight forward to implement and with lots of potential initially and later.

villebro commented 3 years ago

Thanks for the comprehensive explanation @wernerdaehn ! Ping @amitmiran137 , here's a good motivation for adding support for join semantics.

wernerdaehn commented 3 years ago

@villebro A generic join feature I would advise against. Things like a theta join, snowflake-like joins with intermediate tables etc. These can all be built in the database using views. So why implementing these things a second time? I would really support only the most trivial case with a single fact in the middle and directly linked dimension tables with PKs. This gives you the biggest bang for your bucks.

As with the other proposals, let me know if I should create a comprehensive write-up. I would be willing to invest the time, but only if it is worth it.

villebro commented 3 years ago

@wernerdaehn I was speaking in general terms, i.e. supporting joining tables beyond the current table/virtual table functionality.

One thing that sticks out: if we have 100 FKs and thousands of columns, this can quickly become very burdensome, both for the application but even for the end user. So this needs to be properly scoped to make sure it improves the user experience rather than overwhelm it.

wernerdaehn commented 3 years ago

Understood. My fear is just that when you implement that, you start with the SQL join clause in mind. I have seen multiple products fail because the join clause is extremely powerful and frankly, not suited for graphical tools. Use the SQL Editor of Superset for those. Anyway, you got my point, that is the important part.

Regarding the second paragraph, while your statement is certainly correct, it is not related to the star schema feature as such. If all needs to be in a single table, then this single table has equally many columns and must not be overwhelming as well.

Please let me know if I can do any work in that regards, in case it makes sense.

ducchetrongminh commented 3 years ago

I would love to see this feature go live, not sure if there is any progress. The description above is good enough, I just want to share my case. Currently, I am using dbt for transformation, and I am following the star schema practice. Then I will visualize on Metabase. As metabase supports star schema model, this approach works well. If I go with Superset, I will need to create 1 flat table. Some issues with the flat table are:

storage: the dimension data will be duplicated and redundancy
no benefit from incremental processing: when there is change in dimension table, I will need to refresh the whole flat table. If we support star schema, the fact table can remain unchanged
slow on filter values: it is faster to fetch 30k customer names compared to 100mil customer names (but only 30k distinct name)

djouallah commented 2 years ago

@ducchetrongminh you can build a virtual dateset based on a view which join the fact to the dimension tables ?

stale[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. For admin, please label this issue .pinned to prevent stale bot from closing the issue.

jomach commented 2 years ago

I think this feature is very important. For example Microsoft Power BI can do this and metabase can do this too. Example how it works on Metabase:

IMHO This feature is a must

zhaoyongjie commented 2 years ago

@jomach Thanks for the reply. The key point that distinguishes Superset from PBI/Tableau is SQL modeling. The Superset is dependent on SQLLab manually writing SQL to achieve the snowflake or star model.

djouallah commented 2 years ago

@zhaoyongjie that's fair point, but it is very hard to model multiple fact tables with different grains using just one flat table ?

zhaoyongjie commented 2 years ago

Hi @djouallah, the star modeling seems like 1 fact table and some dimension tables. Do you have more use cases for multiple fact tables?

djouallah commented 2 years ago

@zhaoyongjie thanks for Asking, all my workload since I started this BI thing was always multiple fact tables.

wernerdaehn commented 2 years ago

Regarding the multiple fact tables: I would ask to support a single fact table with multiple dimensions only. First because this is significant improvement over the current situation already and second multiple fact tables is the same as a single fact table if a view is used.

Imagine this example: Fact CO contains the cost center data and has date, costcenter, costelement, period, year as dimensions. Fact P contains the plan data per year, costcenter group plus a plan version as dimensions. If you allow the user to simply join those two tables, result will be a disaster. The same plan revenue is duplicated n times because CO is more fine grained. And CO data is duplicated because of the multiple plan versions. And if the user drills down to a single date, no plan data can be shown as planning does not happen for each day, same for the deepest level of a cost center or as soon as a filter on cost element is applied.

So what you must do instead is aggregate the CO data to year and cost center group level, pick from the Fact P the most recent plan version and show just that data. Now both intermediate data sets have a 1:1 relationship and can be joined. This is a complex operation, best implemented as view. Then you would have a third Fact CO_vs_Plan, the created view, that takes care of the correctness of the comparison.

Yes, supporting such out of the box would be the next evolution of this enhancement request but then you are at a full blown semantic layer, something superset tries to stay away from if I am reading correctly between the lines.

jomach commented 2 years ago

Hey Guys, multiple fact tables is not a big issues imho. As people can work around it. Important would be that we can slice and dice data as the user needs. Like fact table And multiple dimentions. (The star schema)

I just build a datawarehouse and I end up using powerbi because there you can define the join dependencies. Imho this is crucial as feature. Otherwise if you datalake with multiple TB and dimensions with a couple of GB, you would have to join that into a single table which is huge.

Thanks

djouallah commented 2 years ago

Star schema is not a showstopper, it can be replaced by a flat, it is a pain if your dimensions changes a lot but there is a solution, multi fact is the real problem, because there is no alternative, Look at the long term solution not a temporary fix.
see Tableau example, they implemented Start Schema, then what happen users keep asking for multi fact.

jomach commented 2 years ago

Star schema is not a showstopper, it can be replaced by a flat, it is a pain if your dimensions changes a lot but there is a solution, multi fact is the real problem, because there is no alternative, Look at the long term solution not a temporary fix.

see Tableau example, they implemented Start Schema, then what happen users keep asking for multi fact.

Do you have a link for that Tableu example ?

Can you explain how you solve ghe star to flat without views ?

djouallah commented 2 years ago

https://help.tableau.com/current/online/en-us/datasource_datamodel.htm

of course, you need a view or a materialized table what I meant is, there is a workaround, but for multi fact, there are none !!!, unless you support nested table which is a pain !!! or merin multiple fact with different grains which is not good IMO

wernerdaehn commented 2 years ago

@djouallah Interesting, I am saying the exact opposite. Let's figure out: You have a fact table for 1m rows, 20 foreign key columns. Each of those 20 dimensions has 10 rows and 100 columns. For example the Material table has material name, material group, material price, material cost, material vendor,...

Combining that into one table means adding 20 times 100 columns times 1m rows each. That is 2bn values, instead of 2010100 = 20k values.

If your propose to store all in a physical table, what was a large table before becomes enormous. If you join that in a view, it is an extreme overhead.

The only option would be a view with join pruning, meaning the SQL optimizer removes joins from the view that are not used by any selected column. This is possible but only under a very strict set of conditions often not met.

The other aspect is the UI. What do you prefer? Selecting the columns from a list with 2000 column names or having a tree where you first select the type, e.g. Material, and then select the material related properties?

So no, Star schema is the simpler thing to do and the by far more important. Combining multiple facts is harder and can be done with views as a workaround. It would be nice if superset supports both but let's be reasonable and start with the by far more important one.

Agree?

djouallah commented 2 years ago

@wernerdaehn I am not disagreeing with you, of course, simple star is better, all what I am saying, people will ask for Galaxy schema from day 1.

wernerdaehn commented 2 years ago

but for multi fact, there are none !!!, unless you support nested table which is a pain !!! or merin multiple fact with different grains which is not good IMO

Why is there no workaround for multiple facts? It is not that difficult. And all that makes that SQL complicated are logical decisions, which the UI must provide as switches as well. So it won't be that much easier.

create view actual_vs_plan as
select year, cost_center_grp, sum(actual), sum(plan) from (
   select year, cost_center_grp, actual, null as plan from fact_co
  union all
  select year, cost_center_grp, null as actual, plan from fact_p where version = 1
)
group by year, cost_center_grp;

djouallah commented 2 years ago

Amit from thoughtspot has a great blog, and he explain better what I am trying to say https://prakasha.substack.com/p/the-metrics-layer-has-growing-up?s=r

jomach commented 2 years ago

but for multi fact, there are none !!!, unless you support nested table which is a pain !!! or merin multiple fact with different grains which is not good IMO

Why is there no workaround for multiple facts? It is not that difficult. And all that makes that SQL complicated are logical decisions, which the UI must provide as switches as well. So it won't be that much easier.
create view actual_vs_plan as

select year, cost_center_grp, sum(actual), sum(plan) from (

   select year, cost_center_grp, actual, null as plan from fact_co

  union all

  select year, cost_center_grp, null as actual, plan from fact_p where version = 1

)

group by year, cost_center_grp;

With this approach you need to read most of the things ans perform a distinct.

Further more you are not leaving the user the option to play with your dashboard. It's kind of hardcoded

wernerdaehn commented 2 years ago

No, there is no distinct clause in the SQL. No, it combines the two measures at the deepest level. Yes you cannot see the plan value e.g. per day but it is not planned per day, hence the data is not available at that level.

Let me ask the question differently: How would you suggest the UI should look like to support combining multiple facts and what would happen under the cover, using my example?

rusackas commented 1 year ago

It seems this SIP proposal was never given an official number, nor was it taken to the Apache Superset dev mailing list for the official discussion as part of the SIP procedure. I'll number it as 94 (even if that's a little strange at this juncture), and request that it either get proposed officially as a step leading toward a vote, or that we close it if there's no interest in pursuing it further. If you want to attend the next Superset Town Hall, that would be a great place to gather a bit of sentiment around it, as well.

rusackas commented 11 months ago

Since there has been no substantive discussion here in about a year and a half, I'm going to close this issue and move it to discarded. If anyone wants to rekindle the effort and carry it forward in the SIP process (i.e. starting a DISCUSS thread on the Apache Superset dev mailing list), I'm more than happy to re-open it and adjust the status.

squalou commented 7 months ago

Hi, I'm wondering how to best use superset with a start model. I found this discussion, unfortunatley closed.

More generally speaking, I wonder how to prevent an issue from being "closed due to inactivity" when it looks like everything that needed to be said was said, and waiting for superset side feedback on feasability.

Should I for instance re-open a new issue in this case referencing this issue ?

(and maybe add a "up!" commet every 6 months ;-) )

zhaoyongjie commented 7 months ago

@squalou , some years ago I replied the issue. For now, I'm also thinking the initial design of Superset never consider data modeling, e.g. snowflake or star modeling. The workaround for data modeling you should consider other sematic/modeling tools e.g. DBT, and expose a "big flat table" to Superset.

squalou commented 7 months ago

@zhaoyongjie thanks !

leonliuaptean commented 5 months ago

we decided not to use Superset because it doesn't support Star Schema

zhaoyongjie commented 5 months ago

@leonliuaptean To be honest, there are lots of workarounds for the Star/Snowflake/and so on data modeling. Superset is a viz tool, and its original design didn't focus too much on data modeling.

jomach commented 5 months ago

@zhaoyongjie could you point us to the examples that you mean? I could not find a very good workaround for it. Take power bi as an example, the support this and it is very good why ? On larger teams where I'm working there are dedicated teams to building dashboards, which actually do not know it works in the background. Something like power bi modeling would be a huge step forward on superset

wernerdaehn commented 5 months ago

@zhaoyongjie Superset should not allow data modeling but be able to consume data from query optimized tables.

Let's see what the current method is, if you want to visualize the data from a data warehouse:

You create a view sales_revenue_per_year_region as select s.year, g.region_name, sum(s.revenue) from sales_orders s join geography g on (g.location = s.location) group by s.year, g.region_name and now you can view this years or last years revenue per region. Somebody else wants to report per month, so another view must be created: sales_revenue_per_month_region as select s.year_month, g.region_name, sum(s.revenue) from sales_orders s join geography g on (g.location = s.location) group by s.year_month, g.region_name. Another wants to report per country, another per material group, another per country and material_group, and .... you get the point.

All to workaround the fact that these queries can be built fully automatically by looking at the columns picked. Columns from the dimension go into the group by, dimensions that are not used are not added, columns with numbers from the fact table are aggregated.

apache / superset