Open cpcloud opened 9 years ago
FWIW, Spark can use the JDBC connector to do computation. It looks like this:
sc = SparkContext(...)
sql = HiveContext(...)
df = sql.load("jdbc", url="jdbc:postgresql:dbserver", dbtable="schema.tablename")
I don't think this would be that tricky. The function would look something like this:
@append.register(SQLContext, sa.Table)
def sql_to_sparksql(ctx, tb, **kwargs):
url = connection_string_from_engine(tb.bind) # <- not implemented, but i think easy
dbtable = [tb.name]
if tb.schema is not None:
dbtable.insert(0, tb.schema)
return ctx.load('jdbc', url=url, dbtable='.'.join(dbtable))
here's the spark docs on using jdbc
https://spark.apache.org/docs/1.3.0/sql-programming-guide.html#jdbc-to-other-databases
If anyone is interested in implementing this, you'll have to sudo apt-get install libpostgresql-jdbc-java
for testing on Travis-CI and then set SPARK_CLASSPATH
to one of the following
/usr/share/java/postgresql-jdbc3-9.1.jar
/usr/share/java/postgresql-jdbc4-9.1.jar
because according to the travis docs the default install is postgres 9.1
By default, the build environment will have version 9.1 running already.
The postgres connection string syntax can be found here:
https://jdbc.postgresql.org/documentation/91/connect.html
one really annoying thing is that it looks like jdbc connection strings aren't standard and can be different for different vendors
here's oracle's:
http://docs.oracle.com/cd/B28359_01/java.111/b31224/urls.htm#BEIJFHHB
migration from blaze
https://github.com/ContinuumIO/blaze/issues/582
reposting for ease of use:
from @chdoig
As Spark is becoming a popular backend, it's a common use case for people to want to transfer their datasets in DBs to a Spark/HDFS cluster.
It would be nice to have an easy interface for end-users to transfer their tables in DBs to a Cluster.
into(Spark/HDFS, SQL DBs)
A lot of people are talking now about tachyon, maybe worth taking a look: http://tachyon-project.org/ http://ampcamp.berkeley.edu/big-data-mini-course/tachyon.html
This might be related with @quasiben work on SparkSQL. Maybe a barrier for people to star using SparkSQL is how they should make that transfer since:
But I'm not able to find how you make that connection from existing SQL DBs: http://spark.apache.org/docs/latest/sql-programming-guide.html
cc: @mrocklin