adobe-research / spark-parquet-thrift-example

Example Spark project using Parquet as a columnar store with Thrift objects.
Apache License 2.0
48 stars 17 forks source link

Support for Thrift UNION definition #1

Closed comcmipi closed 9 years ago

comcmipi commented 9 years ago


I was not able to compile thrift schema containing "union" definition. Are unions supported in the current version?

Thank you very much, Best, Michal

bamos commented 9 years ago

Hi @comcmipi - the sample Thrift schema provided in this repo doesn't use a union definition, are you trying to get this to work with a custom Thrift definition? You can debug your thrift schema by generating Java output with thrift --gen java <your schema>.thrift.

Regards, Brandon.

comcmipi commented 9 years ago

Hi Brandon,

thank you for your reply.

Yes, I'm using custom Thrift definition, something that can be simplified to:

union CoilID { 1: i64 register_id; }

I wasn't correct saying Thrift schema does not compile, it does. Problem arises when I try to compile spark scala program with simple assignment, e.g.

val sampleCoilID = new CoilID(123123123L)

It returns error:

[error] found : Long(123123123L) [error] required: com.adobe.spark_parquet_thrift.CoilID [error] val sampleCoilID = new CoilID(123123123L) [error] ^ [error] one error found error Compilation failed

When I change "union" to "struct", i.e.:

struct CoilID { 1: required i64 register_id; }

everything works fine.

Thank you for your help, Best, Michal

On 01/20/2015 08:15 PM, Brandon Amos wrote:

Hi @comcmipi - the sample Thrift schema provided in this repo doesn't use a union definition, are you trying to get this to work with a custom Thrift definition? You can debug your thrift schema by generating Java output with |thrift --gen java .thrift|.

Regards, Brandon.

— Reply to this email directly or view it on GitHub

Mgr. Michal Pitoňák, PhD. Department of Physical and Theoretical Chemistry, Faculty of Natural Sciences of Comenius University Bratislava, Slovakia

Mlynská Dolina 842 15, Bratislava 4 Slovakia

Office: CH1-2-328 tel: +421 908 706 628

bamos commented 9 years ago

Hi Michal,

I think your issue is from Thrift's Java constructors for objects with unions being different than objects with structs. I changed the SampleThriftObject.thrift in this repo to:

namespace java com.adobe.spark_parquet_thrift

union SampleThriftObject {
  10: string col_a;
  20: string col_b;
  30: string col_c;

With a struct, these objects are initialized with

    val sampleData = Range(1,10){ v: Int =>
      new SampleThriftObject("a"+v)

However, this now causes the same error you're seeing:

┌[bamos☮derecho]-(~/repos/spark-parquet-thrift-example)-[git://master ✗]-
└> sbt assembly
[info] Loading project definition from /home/bamos/repos/spark-parquet-thrift-example/project
[info] Set current project to SparkParquetThrift (in build file:/home/bamos/repos/spark-parquet-thrift-example/)
[warn] There may be incompatibilities among your library dependencies.
[warn] Here are some of the libraries that were evicted:
[warn]  * org.apache.thrift:libthrift:0.7.0 -> 0.9.1
[warn] Run 'evicted' to see detailed eviction warnings
[info] Compiling 1 Scala source to /home/bamos/repos/spark-parquet-thrift-example/target/scala-2.10/classes...
[error] /home/bamos/repos/spark-parquet-thrift-example/src/main/scala/SparkParquetThriftApp.scala:62: type mismatch;
[error]  found   : String
[error]  required: com.adobe.spark_parquet_thrift.SampleThriftObject
[error]       new SampleThriftObject("a"+v)
[error]                                 ^
[error] one error found
[error] (compile:compile) Compilation failed
[error] Total time: 4 s, completed Jan 21, 2015 7:12:47 PM

The source of this can be found in the sbt-thrift generated Thrift object, located at target/scala-2.10/src_managed/main/gen-java/com/adobe/spark_parquet_thrift/

The only available constructors are only from other objects, or with _Fields. Not with strings, as the struct object had.

  public SampleThriftObject() {

  public SampleThriftObject(_Fields setField, Object value) {
    super(setField, value);

  public SampleThriftObject(SampleThriftObject other) {

Further down the definition, there are functions for setting the values within the union:

  public void setCol_a(String value) {
    if (value == null) throw new NullPointerException();
    setField_ = _Fields.COL_A;
    value_ = value;

So, using this information, I'm able to successfully compile:

    val sampleData = Range(1,10){ v: Int =>
      new SampleThriftObject().setCol_a("a" + v)
┌[bamos☮derecho]-(~/repos/spark-parquet-thrift-example)-[git://master ✗]-
└> sbt assembly
[info] Loading project definition from /home/bamos/repos/spark-parquet-thrift-example/project
[info] Set current project to SparkParquetThrift (in build file:/home/bamos/repos/spark-parquet-thrift-example/)
[warn] There may be incompatibilities among your library dependencies.
[warn] Here are some of the libraries that were evicted:
[warn]  * org.apache.thrift:libthrift:0.7.0 -> 0.9.1
[warn] Run 'evicted' to see detailed eviction warnings
[info] Including from cache: slf4j-api-1.7.2.jar
[info] Including from cache: parquet-hadoop-1.5.0.jar
[info] Including from cache: commons-lang-2.4.jar
[info] Including from cache: parquet-column-1.5.0.jar
[info] Including from cache: parquet-format-2.1.0.jar
[info] Including from cache: parquet-common-1.5.0.jar
[info] Including from cache: httpcore-4.2.4.jar
[info] Including from cache: guava-11.0.1.jar
[info] Including from cache: parquet-encoding-1.5.0.jar
[info] Including from cache: akka-slf4j_2.10-2.2.3.jar
[info] Including from cache: libthrift-0.9.1.jar
[info] Including from cache: commons-logging-1.1.1.jar
[info] Including from cache: parquet-generator-1.5.0.jar
[info] Including from cache: jackson-core-asl-1.9.11.jar
[info] Including from cache: commons-codec-1.6.jar
[info] Including from cache: commons-lang3-3.1.jar
[info] Including from cache: json-simple-1.1.jar
[info] Including from cache: hadoop-lzo-0.4.16.jar
[info] Including from cache: jsr305-1.3.9.jar
[info] Including from cache: httpclient-4.2.5.jar
[info] Including from cache: parquet-jackson-1.5.0.jar
[info] Including from cache: protobuf-java-2.4.1.jar
[info] Including from cache: parquet-thrift-1.5.0.jar
[info] Including from cache: elephant-bird-pig-4.4.jar
[info] Including from cache: config-1.0.2.jar
[info] Including from cache: elephant-bird-core-4.4.jar
[info] Including from cache: parquet-pig-1.5.0.jar
[info] Including from cache: elephant-bird-hadoop-compat-4.4.jar
[info] Including from cache: jackson-mapper-asl-1.9.11.jar
[info] Including from cache: akka-actor_2.10-2.2.3.jar
[info] Including from cache: snappy-java-1.0.5.jar
[info] Including from cache: scala-library-2.10.3.jar
[info] Checking every *.class/*.jar file's SHA-1.
[info] Merging files...
[warn] Merging 'META-INF/MANIFEST.MF' with strategy 'discard'
[info] Strategy 'deduplicate' was applied to 3 files (Run the task at debug level to see details)
[warn] Strategy 'discard' was applied to a file
[info] Assembly up to date: /home/bamos/repos/spark-parquet-thrift-example/target/scala-2.10/SparkParquetThrift.jar
[success] Total time: 2 s, completed Jan 21, 2015 7:17:01 PM

Hope this helps!

Regards, Brandon.

comcmipi commented 9 years ago

Thank you very much Brandon,

worked perfectly!

All the best, Michal

On 01/22/2015 01:17 AM, Brandon Amos wrote:

Hi Michal,

I think your issue is from Thrift's Java constructors for objects with unions being different than objects with structs. I changed the |SampleThriftObject.thrift| in this repo to:

namespace java com.adobe.spark_parquet_thrift

union SampleThriftObject { 10: string col_a; 20: string col_b; 30: string col_c; }

With a struct, these objects are initialized with

 val  sampleData  =  Range(1,10){v:Int  =>
   new  SampleThriftObject("a"+v)

However, this now causes the same error you're seeing:

┌[bamos☮derecho]-(~/repos/spark-parquet-thrift-example)-[git://master ✗]- └> sbt assembly [info] Loading project definition from /home/bamos/repos/spark-parquet-thrift-example/project [info] Set current project to SparkParquetThrift (in build file:/home/bamos/repos/spark-parquet-thrift-example/) [warn] There may be incompatibilities among your library dependencies. [warn] Here are some of the libraries that were evicted: [warn] * org.apache.thrift:libthrift:0.7.0 -> 0.9.1 [warn] Run 'evicted' to see detailed eviction warnings [info] Compiling 1 Scala source to /home/bamos/repos/spark-parquet-thrift-example/target/scala-2.10/classes... [error] /home/bamos/repos/spark-parquet-thrift-example/src/main/scala/SparkParquetThriftApp.scala:62: type mismatch; [error] found : String [error] required: com.adobe.spark_parquet_thrift.SampleThriftObject [error] new SampleThriftObject("a"+v) [error] ^ [error] one error found error Compilation failed [error] Total time: 4 s, completed Jan 21, 2015 7:12:47 PM

The source of this can be found in the |sbt-thrift| generated Thrift object, located at |target/scala-2.10/src_managed/main/gen-java/com/adobe/spark_parquet_thrift/|.

The only available constructors are only from other objects, or with _Fields. Not with strings, as the |struct| object had.

public SampleThriftObject() { super(); }

public SampleThriftObject(_Fields setField,Object value) { super(setField, value); }

public SampleThriftObject(SampleThriftObject other) { super(other); }

Further down the definition, there are functions for setting the values within the union:

public void setCola(String value) { if (value== null)throw new NullPointerException(); setField= _Fields.COLA; value= value; }

So, using this information, I'm able to successfully compile:

 val  sampleData  =  Range(1,10){v:Int  =>
   new  SampleThriftObject().setCol_a("a"  +  v)
┌[bamos☮derecho]-(~/repos/spark-parquet-thrift-example)-[git://master ✗]- └> sbt assembly [info] Loading project definition from /home/bamos/repos/spark-parquet-thrift-example/project [info] Set current project to SparkParquetThrift (in build file:/home/bamos/repos/spark-parquet-thrift-example/) [warn] There may be incompatibilities among your library dependencies. [warn] Here are some of the libraries that were evicted: [warn] * org.apache.thrift:libthrift:0.7.0 -> 0.9.1 [warn] Run 'evicted' to see detailed eviction warnings [info] Including from cache: slf4j-api-1.7.2.jar [info] Including from cache: parquet-hadoop-1.5.0.jar [info] Including from cache: commons-lang-2.4.jar [info] Including from cache: parquet-column-1.5.0.jar [info] Including from cache: parquet-format-2.1.0.jar [info] Including from cache: parquet-common-1.5.0.jar [info] Including from cache: httpcore-4.2.4.jar [info] Including from cache: guava-11.0.1.jar [info] Including from cache: parquet-encoding-1.5.0.jar [info] Including from cache: akka-slf4j_2.10-2.2.3.jar [info] Including from cache: libthrift-0.9.1.jar [info] Including from cache: commons-logging-1.1.1.jar [info] Including from cache: parquet-generator-1.5.0.jar [info] Including from cache: jackson-core-asl-1.9.11.jar [info] Including from cache: commons-codec-1.6.jar [info] Including from cache: commons-lang3-3.1.jar [info] Including from cache: json-simple-1.1.jar [info] Including from cache: hadoop-lzo-0.4.16.jar [info] Including from cache: jsr305-1.3.9.jar [info] Including from cache: httpclient-4.2.5.jar [info] Including from cache: parquet-jackson-1.5.0.jar [info] Including from cache: protobuf-java-2.4.1.jar [info] Including from cache: parquet-thrift-1.5.0.jar [info] Including from cache: elephant-bird-pig-4.4.jar [info] Including from cache: config-1.0.2.jar [info] Including from cache: elephant-bird-core-4.4.jar [info] Including from cache: parquet-pig-1.5.0.jar [info] Including from cache: elephant-bird-hadoop-compat-4.4.jar [info] Including from cache: jackson-mapper-asl-1.9.11.jar [info] Including from cache: akka-actor2.10-2.2.3.jar [info] Including from cache: snappy-java-1.0.5.jar [info] Including from cache: scala-library-2.10.3.jar [info] Checking every .class/_.jar file's SHA-1. [info] Merging files... [warn] Merging 'META-INF/MANIFEST.MF' with strategy 'discard' [info] Strategy 'deduplicate' was applied to 3 files (Run the task at debug level to see details) [warn] Strategy 'discard' was applied to a file [info] Assembly up to date: /home/bamos/repos/spark-parquet-thrift-example/target/scala-2.10/SparkParquetThrift.jar [success] Total time: 2 s, completed Jan 21, 2015 7:17:01 PM

Hope this helps!

Regards, Brandon.

— Reply to this email directly or view it on GitHub

Mgr. Michal Pitoňák, PhD. Department of Physical and Theoretical Chemistry, Faculty of Natural Sciences of Comenius University Bratislava, Slovakia

Mlynská Dolina 842 15, Bratislava 4 Slovakia

Office: CH1-2-328 tel: +421 908 706 628