apache / superset

Apache Superset is a Data Visualization and Data Exploration Platform
https://superset.apache.org/
Apache License 2.0
61.88k stars 13.55k forks source link

Support of JOIN statement for queries #875

Closed ololobus closed 8 years ago

ololobus commented 8 years ago

Are there any plans to add JOINs support?

Here is a simple case: there are two tables documents and languages, every document has language_id column, every language has name and code columns. Thus, now I can only display distribution by language_id which is not too informative. document_languages

It could be solved andding a simple join statement with select of e.g. languages.code instead of documents.language_id.

I would be glad to help somehow and even started reading the project code, but it's hard enough to navigate through all this code. I found that main functions for queries performing are placed here https://github.com/airbnb/caravel/blob/master/caravel/models.py

P.S. Sorry, if I missed something and there is already a possibility to do it somehow.

mistercrunch commented 8 years ago

http://airbnb.io/caravel/faq.html#can-i-query-join-multiple-tables-at-one-time

Can I query/join multiple tables at one time?

Not directly no. A Caravel SQLAlchemy datasource can only be a single table or a view.

When working with tables, the solution would be to materialize a table that contains all the fields needed for your analysis, most likely through some scheduled batch process.

A view is a simple logical layer that abstract an arbitrary SQL queries as a virtual table. This can allow you to join and union multiple tables, and to apply some transformation using arbitrary SQL expressions. The limitation there is your database performance as Caravel effectively will run a query on top of your query (view). A good practice may be to limit yourself to joining your main large table to one or many small tables only, and avoid using GROUP BY where possible as Caravel will do its own GROUP BY and doing the work twice might slow down performance.

Whether you use a table or a view, the important factor is whether your database is fast enough to serve it in an interactive fashion to provide a good user experience in Caravel.

mistercrunch commented 8 years ago

Though we may add support for joinable tables in the future it's not on the short term roadmap, it would only be for many to one left joins on static joining criteria.

ololobus commented 8 years ago

@mistercrunch I've seen this FAQ answer, of course. But it looks more confusing than helpful for me, because I understood it in a way, that I can create new views inside Caravel somehow rather than directly inside DB. OK, thank you.

alanmcruickshank commented 8 years ago

@ololobus @mistercrunch This is actually possible at the moment through some recently added functionality added for the work on additional datetimes. This method is way faster than using views (unless you're able to materialise), due to how sqlalchemy executes the query.

Head into your table setup and click the edit button: screenshot-a

Go to the "List Table Columns" Tab and click add screenshot-b

give your column a name (in your case, language_code or something like that), select group by and filter by and then in expression add something like:

(select langauges.code from languages where language_id = documents.languages_id)

Save and then you should be able to group by this column.

What you're effectively doing it added a composite column into the sqlalchemy model which is then injected directly into the query at runtime.

ololobus commented 8 years ago

@alanmcruickshank It worked, thank you!

I've take a look on the produced SQL query in this case:

SELECT (select languages.code from languages where languages.id =
    documents.language_id) AS language_code, COUNT(*) AS count 
    FROM documents JOIN (SELECT (select languages.code from languages where languages.id =
    documents.language_id) AS language_code__ 
    FROM documents 
    WHERE created_at >= '2016-07-26 23:09:12.000000' AND created_at <= '2016-08-02 23:09:12.000000' GROUP BY (select languages.code from languages where languages.id =
    documents.language_id) ORDER BY COUNT(*) DESC 
     LIMIT 50) AS anon_1 ON (select languages.code from languages where languages.id =
    documents.language_id) = language_code__ 
    WHERE created_at >= '2016-07-26 23:09:12.000000' AND created_at <= '2016-08-02 23:09:12.000000' GROUP BY (select languages.code from languages where languages.id =
    documents.language_id) ORDER BY count DESC 
     LIMIT 50000

It results in a very tricky JOIN, so created inside db view with joined tables, e.g. documents_language will perform relatively the same even if it's not materialized.

Anyway, I'm glad that I can achieve the same result directly from Caravel. Thank you again.

P.S. It would be great to add such an example inside this FAQ answer, because it's frequent case, I guess :)

QiXuanWang commented 7 years ago

I found this issue when I was searching for multiple tables support. In my application, I need to perform data analysis based on raw tables and then display it through superset (hopefully). My raw database used many small tables, while new table/view to be displayed could be stored in a materialized table for view purpose if needed. But I'm not sure if it's a proper use case for superset anymore. If anyone has comment, I'd like to see how this could be solved.

mistercrunch commented 7 years ago

Using a view, materialized or not, is how you'd make that happen in Superset. It is a proper use case.

QiXuanWang commented 7 years ago

Thanks a lot @mistercrunch I was wondering since I didn't fully get your comment above: "A view is a simple logical layer that abstract an arbitrary SQL queries as a virtual table. This can allow you to join and union multiple tables, and to apply some transformation using arbitrary SQL expressions. "

Not sure I fully understand this. By "view" do you mean a class in views.py or just a SQL expression? For me, I'll make some complicated data manipulating on raw tables so a simple SQL expression should not work. I'm not quite familiar with views.py(FAB) yet so I can't say too much about this.

mistercrunch commented 7 years ago

I meant it in the database/SQL sense of the term, as in CREATE VIEW foo AS SELECT...

QiXuanWang commented 7 years ago

Appreciate it. Now I understand. My application will call python functions to do post-processing so I guess a materialized table(though it should be called data-source too) is necessary.

gclsoft commented 7 years ago

@alanmcruickshank Not work at all.

SELECT cfname,
       orgname FROM (SELECT cfname AS cfname,
                           (select org.fname_l2
                              from t_org_ctrlunit org
                             where fid = org.fcontrolunitid) AS orgname
                      FROM ct_crm_customerinfo
                     WHERE fcreatetime >=
                           TO_TIMESTAMP('2017-08-12T08:21:17\',
                                        'YYYY-MM-DD"T"HH24:MI:SS.ff6\')
                       AND fcreatetime <=
                           TO_TIMESTAMP('2017-08-19T08:21:17\',
                                        'YYYY-MM-DD"T"HH24:MI:SS.ff6\')) WHERE ROWNUM <= 50000

I added (select org.fname_l2 from t_org_ctrlunit org where fid = org.fcontrolunitid) in the Expression. Supderset can't even left join, the sql is wrong. How to solve it?

mistercrunch commented 7 years ago

@gclsoft SELECT in expressions as IN

SELECT (SELECT...) ...

isn't supported by all database engines.

Just create a view!

alanmcruickshank commented 7 years ago

@gclsoft - what database engine are you using?

The other thing that might help you is since this original question, the release of the SQL Lab feature has made it much easier to create tables in superset which point at joined tables within your database.

lilloraffa commented 7 years ago

@alanmcruickshank Hi there, I'm trying to explore what you are saying (use SQL Lab to create tables based on sql statement, and so also table made of joins), but I've not been very successful to find a well defined process to do that. Only thing I figured out is to use SQL Lab -> run the query -> click on "visualize" ->this create "magically" a new datasource, which seems more a workaround than a proper process to create a new datasource from a query. Would you be able to give us more details in case there actually is a more "structural" way to do the job? Thanks a lot!

xrmx commented 7 years ago

@lilloraffa that's it, what do you have in mind for a more structured way to handle it?

lilloraffa commented 7 years ago

@xrmx well, since tables are created and managed in Sources->Tables, I would expect a functionality there to let me create tables based on sql statements. I mean, letting people create table using sql statement from a "visualize" function in SQL Lab seems more like a workaround to me.

xrmx commented 7 years ago

@lilloraffa i think the idea is that the people will explore the sql they need as datasource as they do for the data. And they don't have access to to a sql shell :)

Your use case is that you already know the sql and have another shell to try it and you just want to copy and paste it in a new table right? That should be doable.

lilloraffa commented 7 years ago

@xrmx that is correct, in general I would like to have the flexibility to define table with sql statement so to join more table, even coming from different database. My general use case is to provide a general purpose analytics framework (we are evaluating superset for this) to be connected to a Big Data Infrastructure, and user can choose which dataset (or join of dataset) to use in their analysis. Thanks a lot!

xrmx commented 7 years ago

@lilloraffa it shouldn't take much to implement, feel free to open a PR :) cough If it's challenging enough looks more useful than a theme ;) cough

mistercrunch commented 7 years ago

The tricky part is getting the column metadata. In the Visualize button flow in SQL Lab, we already know the columns and have an idea of the data types as we have the result set on hand when the button is clicked.

To do this in the "Add Table" flow, we'd have to run a quick query to do this, which may time out or whatever and again we're just guessing the data types based on the result set which isn't ideal.

Seems like it would be much easier to have another flow that would run a CREATE VIEW on the user's behalf (assuming the db user in the connection string has the permission to do that), and then just use the view as if it were a table (then it's easy to get the actual data types from the DB).

Though then we may need some way to allow users to ALTER VIEW as well. All of this is kind of a can of worms...

olokedeomotayo commented 6 years ago

Dear All,

I have a table i joined in SQL Lab in Superset but only one of the Table is displaying. I have used LEFT and Right Join but same thing.. Any help please.

xrmx commented 6 years ago

@olokedeomotayo you already filed an issue, no need to repeat yourself here please.

gclsoft commented 6 years ago

Does superset support left join now? Instead of using views or only one table?

mistercrunch commented 6 years ago

Still no support for joins at the moment and in the near future. We still tell people to create a view and use that as a source. Many more advanced dbs with good query optmizers will in fact omit the join from the plan when left-joining on a field that has a unique constraint and no fields from the remote table are referenced in the query. My point being that there may be no extra cost for unused joins in views in some cases.

gclsoft commented 4 years ago

Does superset support left join now? Instead of using views or only one table?

JayGuAtGitHub commented 4 years ago

Does superset support left join now? Instead of using views or only one table?

seems still not support after my 1 hour research

blackhawk28 commented 3 years ago

@gclsoft SELECT in expressions as IN

SELECT (SELECT...) ...

isn't supported by all database engines.

Just create a view!

by this can we join table from multiple databases ?

blackhawk28 commented 3 years ago

@xrmx that is correct, in general I would like to have the flexibility to define table with sql statement so to join more table, even coming from different database. My general use case is to provide a general purpose analytics framework (we are evaluating superset for this) to be connected to a Big Data Infrastructure, and user can choose which dataset (or join of dataset) to use in their analysis. Thanks a lot!

hey were you able to do that join table from multiple databases?

benrii22 commented 3 years ago

@blackhawk28 Hey, did you find out how to join tables from different databases?