into(Spark/HDFS, SQL DBs)

cpcloud commented 9 years ago

migration from blaze

https://github.com/ContinuumIO/blaze/issues/582

reposting for ease of use:

from @chdoig

As Spark is becoming a popular backend, it's a common use case for people to want to transfer their datasets in DBs to a Spark/HDFS cluster.

It would be nice to have an easy interface for end-users to transfer their tables in DBs to a Cluster.

into(Spark/HDFS, SQL DBs)

A lot of people are talking now about tachyon, maybe worth taking a look: http://tachyon-project.org/ http://ampcamp.berkeley.edu/big-data-mini-course/tachyon.html

This might be related with @quasiben work on SparkSQL. Maybe a barrier for people to star using SparkSQL is how they should make that transfer since:

A SchemaRDD can be created from an existing RDD, Parquet file, a JSON dataset, or by running HiveQL against data stored in Apache Hive.

But I'm not able to find how you make that connection from existing SQL DBs: http://spark.apache.org/docs/latest/sql-programming-guide.html

cc: @mrocklin

cpcloud commented 9 years ago

FWIW, Spark can use the JDBC connector to do computation. It looks like this:

sc = SparkContext(...)
sql = HiveContext(...)
df = sql.load("jdbc", url="jdbc:postgresql:dbserver", dbtable="schema.tablename")

I don't think this would be that tricky. The function would look something like this:

@append.register(SQLContext, sa.Table)
def sql_to_sparksql(ctx, tb, **kwargs):
    url = connection_string_from_engine(tb.bind)  # <- not implemented, but i think easy
    dbtable = [tb.name]
    if tb.schema is not None:
        dbtable.insert(0, tb.schema)
    return ctx.load('jdbc', url=url, dbtable='.'.join(dbtable))

cpcloud commented 9 years ago

here's the spark docs on using jdbc

https://spark.apache.org/docs/1.3.0/sql-programming-guide.html#jdbc-to-other-databases

cpcloud commented 9 years ago

If anyone is interested in implementing this, you'll have to sudo apt-get install libpostgresql-jdbc-java for testing on Travis-CI and then set SPARK_CLASSPATH to one of the following

/usr/share/java/postgresql-jdbc3-9.1.jar
/usr/share/java/postgresql-jdbc4-9.1.jar

because according to the travis docs the default install is postgres 9.1

By default, the build environment will have version 9.1 running already.

cpcloud commented 9 years ago

The postgres connection string syntax can be found here:

https://jdbc.postgresql.org/documentation/91/connect.html

one really annoying thing is that it looks like jdbc connection strings aren't standard and can be different for different vendors

here's oracle's:

http://docs.oracle.com/cd/B28359_01/java.111/b31224/urls.htm#BEIJFHHB

Alt text

blaze / odo

into(Spark/HDFS, SQL DBs) #31