apache / superset

Apache Superset is a Data Visualization and Data Exploration Platform
https://superset.apache.org/
Apache License 2.0
61.84k stars 13.54k forks source link

Spark SQL backend (to support Elasticsearch, Cassandra, etc) #241

Closed sscarduzio closed 3 years ago

sscarduzio commented 8 years ago

I can't resist saying Caravel looks much neater than Kibana, plus the user management doesn't cost money and it's not an afterthought. It would be amazing to see Caravel replacing my Kibana dashboard, using the data I've got currently in Elasticsearch.

You use an SQL interface to query the data store, is there any chance Caravel can speak to Elasticsearch through Spark SQL? Spark has a mature Elasticsearch connector, so it should be OK.

And wait.. If you support Spark SQL, you'll be immediately able to support HDFS, Cassandra, HBase, Hive, Tachyon, and any Hadoop data source!

Is this a path worth exploring for this project? I think it's quite exciting.

gbrian commented 8 years ago

+1 I'm looking for Apache Drill connector, as well

ariepratama commented 8 years ago

+1 on this feature too

mistercrunch commented 8 years ago

Totally worth doing, there's 2 paths for it, either by creating a SqlAlchemy dialect (might not be possible is Spark SQL is funky), or creating a new datasource and implementing the query interface. For now we have 2 datasources: sqlalchemy or druid. It's totally doable to add a third one, it just needs to implement something like: https://github.com/airbnb/caravel/blob/master/caravel/models.py#L460

Basically you need to receive these parameters and return a pandas dataframe.

mistercrunch commented 8 years ago

We use Spark at Airbnb and have some SparkSql in places, we might have use cases for it internally, but I'm not sure where it fits in the priority list.

sscarduzio commented 8 years ago

Cool thanks for the pointers! This new connector would surely unlock a wealth of valuable contributions from other businesses which happen to not use Druid or a plain RDBMS.

Sounds like a good investment to me :)

joshwalters commented 8 years ago

I am really interested in adding Hive support, I may take a crack at it sometime in the next few weeks. Dropbox has a Python/Hive project that I was looking at: https://github.com/dropbox/PyHive

gbrian commented 8 years ago

Does it means Impala as well? Thanks

guang commented 8 years ago

+1

csalperwyck commented 8 years ago

+1 for Hive

joshwalters commented 8 years ago

@gbrian Yes, the package I am looking at would add support for Hive and Impala. I opened an issue to track this: https://github.com/airbnb/caravel/issues/339

OElesin commented 8 years ago

Great work guys, but can I load data from Elasticsearch?

rahulgagrani commented 8 years ago

+1 to addition of Elasticsearch support.

philippfrenzel commented 8 years ago

+1

povilasb commented 8 years ago

+1

nabilblk commented 8 years ago

+1 for Hive

bwboy commented 8 years ago

+1 for Hive and Elasticsearch

JohnOmernik commented 8 years ago

I am working on an Apache Drill Sql Alchemy Dialect. I have some basic things working, and have been working with others on the Drill mailing list. There has been talk of plugging Drill to Elastic Search, which seems a bit convoluted, however, since Elasticsearch doesn't have a SQL interface, Drill works really nice, if we get a Dialect working for Drill, then other storage plugins will (hopefully) just work. Some of the work can be found here:

Docker container with pyodbc, unixodbc, Drill ODBC, and caravel all working:

https://github.com/JohnOmernik/caraveldrill

Drill Dialect (work in progress, feel free to play with it and try it, please report issues as you find them, this is iterative brute force programming at this point!) https://github.com/JohnOmernik/sqlalchemy-drill

sathieu commented 8 years ago

I've taken a different approach and started a native backend.

WIP is at https://github.com/sathieu/caravel/tree/elasticsearch (beware: I'll squash commits and force push).

Not much is working yet, and I don't have dedicated time on it. We'll see what comes.

tninja commented 8 years ago

+1 to sparksql

bolkedebruin commented 8 years ago

For what is worth: spark 2 will be sql compliant so then a sqlalchemy dialect is feasible

benvogan commented 8 years ago

+1 for spark SQL. That will get you connected to most data sources these days.

giaosudau commented 8 years ago
shkr commented 8 years ago

You can connect it to Spark SQL. If it uses a hive back-end then you refer to this documentation page for instructions on how to connect sparkl sql via a jdbc+hive connector. https://docs.cloud.databricks.com/docs/latest/databricks_guide/index.html#01%20Databricks%20Overview/14%20Third%20Party%20Integrations/05%20Beeline.html. The one I prefer is dropbox/pyhive to connect to spark sql in my python projects. For scala or java the jdbc+hive will be preferable.

sbookworm commented 8 years ago

+1 for spark sql

mistercrunch commented 8 years ago

Sweet! Can other confirm that SparkSQL works for them through SQL alchemy?!

mistercrunch commented 8 years ago

Giving hints about how to use SparkSQL in the docs: https://github.com/airbnb/caravel/pull/803

giaosudau commented 8 years ago

@mistercrunch Right now it does. But in long query it stops a hold process. I think it relate to thread.

maver1ck commented 7 years ago

I used Spark Thrift Server with Pyhive and it almost works (I need to change one line in hive dialect)

kaiosama commented 7 years ago

@shkr Hi, I am trying to achieve the same thing with pyHive and have not been able to make it work. What is the URI you are using for setting up Superset data source? I am trying something like jdbc+hive://localhost:10000/, and it gives an error: "Can't load plugin: sqlalchemy.dialects:jdbc.hive". I am sure I must be missing something here.. Thanks in advance for any instructions on this.

-- update -- Looks like I had a hiveserver2 problem, I restarted it and then I was able to use this URI: hive://user@localhost:10000/database However I can't get what is listed on the wiki to work (jdbc+hive://), the error message is "Can't load plugin: sqlalchemy.dialects:jdbc.hive"

I have another question that is, what do you mean when you say use SparkSql as backend? I am fairly new to this, but AFAIK I can save dataframes in SparkSql to a Hive table, from which I can then create a Superset table/slice using the above connector. But is there more that I can do to make this process better? My overall goal is to be able to create tables/ slices from parquet files on HDFS.

ChethanChandra commented 7 years ago

+1 for Elasticsearch support.

santhavathi commented 7 years ago

@giaosudau, What is the SQLAlchemy URI, I should give in superset to connect to SparkSQL I used below and it is not working, 172.31.12.201 is where 1.6.2 spark master runs hive://172.31.12.201:7077/test_database

shkr commented 7 years ago

@santhavathi when you open spark ui dashboard, there is a ip printed on top, which is the hostname of the head of the cluster. you have to use that, as your hostname in the hive url.

example : hive://<spark-cluster-master/

santhavathi commented 7 years ago

@shkr, thanks so much for the reply. I had to start the hive server (spark thrift server) on my spark cluster. Also giving hive:// gives below error ERROR: Connection failed!

The error message returned was: Could not locate column in row for column 'tab_name'

I used impala:// and it works now.

cduverne commented 7 years ago

Hello guys, I see in the documentation that SparkSQL is supported : http://airbnb.io/superset/installation.html#database-dependencies.

What does this concretely mean ? Which DB can we query then ?

Thanks a lot in advance.

kaiosama commented 7 years ago

@shkr according to your latest comment, I tried the following URI: hive://172.17.0.2, where 172.17.0.2 is what I got from spark UI.

It allows me to add it as a database, so far so good. However when I query against a table in this database, the job tracker shows a MapReduce job. I would expect the job to be a Spark job though, is it true in your case? I was able to connect to local hive using hive://localhost:10000, so far these two work like the same thing to me.

santhavathi commented 7 years ago

@kaiosama, when you said you are connecting to hive://172.17.0.2, what is the port you used here, and are you directly connecting to spark master without hiveserver running?

kaiosama commented 7 years ago

@santhavathi that is the full URI I used, without port #. I tried using some port #s from the spark UI page but none of them works.

It was with a running hive server. Maybe I am missing something here, but it seems to me that Spark-sql is supposed to be used against Hive, i.e. you always need a running Hive server? Or can the Spark-sql connector be used against other sources? It's like @cduverne mentioned, it's not very clear to me. And I have not got any replies about how to get "jdbc+hive" work as said in the document.

oblamine commented 7 years ago

+1 for Hbase support :)

mistercrunch commented 7 years ago

At Airbnb we can do Hbase through Presto with the HBase Presto connector.

oblamine commented 7 years ago

would you please give me a link so i can follow install steps?

balchandra commented 7 years ago

Hi Can someone, please list down steps to do to connect ElasticSearch from Superset. It would be great help

mistercrunch commented 7 years ago

@balchandra it would involve using this: https://github.com/loverajoel/sqlalchemy-elasticquery

shkr commented 7 years ago

@kaiosama The hostname directs the sql-alchemy to use SQL at the given port. Hard to say whether a map reduce is the normal behavior to expect, without knowing details about your setup of hive, map reduce and spark.

balchandra commented 7 years ago

@mistercrunch... I tried using the same ...connecting Superset with Sqlalchemy-elasticquery. I was able to connect when both Superset and Elasticsearch are installed in same server. Also i was not able to view table/indices when got connected. Can you tell me how exactly it is supposed to be used. It will help me to great extent Thanks in advance

mistercrunch commented 7 years ago

Looks like sqlalchemy-elasticquery isn't what I thought it was. Depending on how ANSI compliant ElasticSearch's SQL is, it may be possible to create your own sqlalchemy dialect. If not, someone would have to create a new connector for it. Luckily I recently refactored and formalized the connector abstraction.

xycloud commented 7 years ago

+1 for elasticsearch

zbidi commented 7 years ago

+1 for elasticsearch

hongqp commented 7 years ago

+1 for Hive and Elasticsearch

mistercrunch commented 4 years ago

Good news about ElasticSearch here! https://github.com/apache/incubator-superset/pull/8441

srinify commented 3 years ago

Closing since Superset now works with Elasticsearch!

https://superset.apache.org/docs/databases/elasticsearch