adobe-research / spark-parquet-thrift-example

Example Spark project using Parquet as a columnar store with Thrift objects.
Apache License 2.0
48 stars 17 forks source link

Support for Thrift UNION definition #1

Closed comcmipi closed 9 years ago

comcmipi commented 9 years ago

Hi,

I was not able to compile thrift schema containing "union" definition. Are unions supported in the current version?

Thank you very much, Best, Michal

bamos commented 9 years ago

Hi @comcmipi - the sample Thrift schema provided in this repo doesn't use a union definition, are you trying to get this to work with a custom Thrift definition? You can debug your thrift schema by generating Java output with thrift --gen java <your schema>.thrift.

Regards, Brandon.

comcmipi commented 9 years ago

Hi Brandon,

thank you for your reply.

Yes, I'm using custom Thrift definition, something that can be simplified to:

union CoilID { 1: i64 register_id; }

I wasn't correct saying Thrift schema does not compile, it does. Problem arises when I try to compile spark scala program with simple assignment, e.g.

val sampleCoilID = new CoilID(123123123L)

It returns error:

[error] found : Long(123123123L) [error] required: com.adobe.spark_parquet_thrift.CoilID [error] val sampleCoilID = new CoilID(123123123L) [error] ^ [error] one error found error Compilation failed

When I change "union" to "struct", i.e.:

struct CoilID { 1: required i64 register_id; }

everything works fine.

Thank you for your help, Best, Michal

On 01/20/2015 08:15 PM, Brandon Amos wrote:

Hi @comcmipi https://github.com/comcmipi - the sample Thrift schema provided in this repo doesn't use a union definition, are you trying to get this to work with a custom Thrift definition? You can debug your thrift schema by generating Java output with |thrift --gen java .thrift|.

Regards, Brandon.

— Reply to this email directly or view it on GitHub https://github.com/adobe-research/spark-parquet-thrift-example/issues/1#issuecomment-70713693.

Mgr. Michal Pitoňák, PhD. Department of Physical and Theoretical Chemistry, Faculty of Natural Sciences of Comenius University Bratislava, Slovakia

Mlynská Dolina 842 15, Bratislava 4 Slovakia

Office: CH1-2-328 tel: +421 908 706 628

bamos commented 9 years ago

Hi Michal,

I think your issue is from Thrift's Java constructors for objects with unions being different than objects with structs. I changed the SampleThriftObject.thrift in this repo to:

namespace java com.adobe.spark_parquet_thrift

union SampleThriftObject {
  10: string col_a;
  20: string col_b;
  30: string col_c;
}

With a struct, these objects are initialized with

    val sampleData = Range(1,10).toSeq.map{ v: Int =>
      new SampleThriftObject("a"+v)
    }

However, this now causes the same error you're seeing:

┌[bamos☮derecho]-(~/repos/spark-parquet-thrift-example)-[git://master ✗]-
└> sbt assembly
[info] Loading project definition from /home/bamos/repos/spark-parquet-thrift-example/project
[info] Set current project to SparkParquetThrift (in build file:/home/bamos/repos/spark-parquet-thrift-example/)
[warn] There may be incompatibilities among your library dependencies.
[warn] Here are some of the libraries that were evicted:
[warn]  * org.apache.thrift:libthrift:0.7.0 -> 0.9.1
[warn] Run 'evicted' to see detailed eviction warnings
[info] Compiling 1 Scala source to /home/bamos/repos/spark-parquet-thrift-example/target/scala-2.10/classes...
[error] /home/bamos/repos/spark-parquet-thrift-example/src/main/scala/SparkParquetThriftApp.scala:62: type mismatch;
[error]  found   : String
[error]  required: com.adobe.spark_parquet_thrift.SampleThriftObject
[error]       new SampleThriftObject("a"+v)
[error]                                 ^
[error] one error found
[error] (compile:compile) Compilation failed
[error] Total time: 4 s, completed Jan 21, 2015 7:12:47 PM

The source of this can be found in the sbt-thrift generated Thrift object, located at target/scala-2.10/src_managed/main/gen-java/com/adobe/spark_parquet_thrift/SampleThriftObject.java.

The only available constructors are only from other objects, or with _Fields. Not with strings, as the struct object had.

  public SampleThriftObject() {
    super();
  }

  public SampleThriftObject(_Fields setField, Object value) {
    super(setField, value);
  }

  public SampleThriftObject(SampleThriftObject other) {
    super(other);
  }

Further down the definition, there are functions for setting the values within the union:

  public void setCol_a(String value) {
    if (value == null) throw new NullPointerException();
    setField_ = _Fields.COL_A;
    value_ = value;
  }

So, using this information, I'm able to successfully compile:

    val sampleData = Range(1,10).toSeq.map{ v: Int =>
      new SampleThriftObject().setCol_a("a" + v)
    }
┌[bamos☮derecho]-(~/repos/spark-parquet-thrift-example)-[git://master ✗]-
└> sbt assembly
[info] Loading project definition from /home/bamos/repos/spark-parquet-thrift-example/project
[info] Set current project to SparkParquetThrift (in build file:/home/bamos/repos/spark-parquet-thrift-example/)
[warn] There may be incompatibilities among your library dependencies.
[warn] Here are some of the libraries that were evicted:
[warn]  * org.apache.thrift:libthrift:0.7.0 -> 0.9.1
[warn] Run 'evicted' to see detailed eviction warnings
[info] Including from cache: slf4j-api-1.7.2.jar
[info] Including from cache: parquet-hadoop-1.5.0.jar
[info] Including from cache: commons-lang-2.4.jar
[info] Including from cache: parquet-column-1.5.0.jar
[info] Including from cache: parquet-format-2.1.0.jar
[info] Including from cache: parquet-common-1.5.0.jar
[info] Including from cache: httpcore-4.2.4.jar
[info] Including from cache: guava-11.0.1.jar
[info] Including from cache: parquet-encoding-1.5.0.jar
[info] Including from cache: akka-slf4j_2.10-2.2.3.jar
[info] Including from cache: libthrift-0.9.1.jar
[info] Including from cache: commons-logging-1.1.1.jar
[info] Including from cache: parquet-generator-1.5.0.jar
[info] Including from cache: jackson-core-asl-1.9.11.jar
[info] Including from cache: commons-codec-1.6.jar
[info] Including from cache: commons-lang3-3.1.jar
[info] Including from cache: json-simple-1.1.jar
[info] Including from cache: hadoop-lzo-0.4.16.jar
[info] Including from cache: jsr305-1.3.9.jar
[info] Including from cache: httpclient-4.2.5.jar
[info] Including from cache: parquet-jackson-1.5.0.jar
[info] Including from cache: protobuf-java-2.4.1.jar
[info] Including from cache: parquet-thrift-1.5.0.jar
[info] Including from cache: elephant-bird-pig-4.4.jar
[info] Including from cache: config-1.0.2.jar
[info] Including from cache: elephant-bird-core-4.4.jar
[info] Including from cache: parquet-pig-1.5.0.jar
[info] Including from cache: elephant-bird-hadoop-compat-4.4.jar
[info] Including from cache: jackson-mapper-asl-1.9.11.jar
[info] Including from cache: akka-actor_2.10-2.2.3.jar
[info] Including from cache: snappy-java-1.0.5.jar
[info] Including from cache: scala-library-2.10.3.jar
[info] Checking every *.class/*.jar file's SHA-1.
[info] Merging files...
[warn] Merging 'META-INF/MANIFEST.MF' with strategy 'discard'
[info] Strategy 'deduplicate' was applied to 3 files (Run the task at debug level to see details)
[warn] Strategy 'discard' was applied to a file
[info] Assembly up to date: /home/bamos/repos/spark-parquet-thrift-example/target/scala-2.10/SparkParquetThrift.jar
[success] Total time: 2 s, completed Jan 21, 2015 7:17:01 PM

Hope this helps!

Regards, Brandon.

comcmipi commented 9 years ago

Thank you very much Brandon,

worked perfectly!

All the best, Michal

On 01/22/2015 01:17 AM, Brandon Amos wrote:

Hi Michal,

I think your issue is from Thrift's Java constructors for objects with unions being different than objects with structs. I changed the |SampleThriftObject.thrift| in this repo to:

namespace java com.adobe.spark_parquet_thrift

union SampleThriftObject { 10: string col_a; 20: string col_b; 30: string col_c; }

With a struct, these objects are initialized with

 val  sampleData  =  Range(1,10).toSeq.map{v:Int  =>
   new  SampleThriftObject("a"+v)
 }

However, this now causes the same error you're seeing:

┌[bamos☮derecho]-(~/repos/spark-parquet-thrift-example)-[git://master ✗]- └> sbt assembly [info] Loading project definition from /home/bamos/repos/spark-parquet-thrift-example/project [info] Set current project to SparkParquetThrift (in build file:/home/bamos/repos/spark-parquet-thrift-example/) [warn] There may be incompatibilities among your library dependencies. [warn] Here are some of the libraries that were evicted: [warn] * org.apache.thrift:libthrift:0.7.0 -> 0.9.1 [warn] Run 'evicted' to see detailed eviction warnings [info] Compiling 1 Scala source to /home/bamos/repos/spark-parquet-thrift-example/target/scala-2.10/classes... [error] /home/bamos/repos/spark-parquet-thrift-example/src/main/scala/SparkParquetThriftApp.scala:62: type mismatch; [error] found : String [error] required: com.adobe.spark_parquet_thrift.SampleThriftObject [error] new SampleThriftObject("a"+v) [error] ^ [error] one error found error Compilation failed [error] Total time: 4 s, completed Jan 21, 2015 7:12:47 PM

The source of this can be found in the |sbt-thrift| generated Thrift object, located at |target/scala-2.10/src_managed/main/gen-java/com/adobe/spark_parquet_thrift/SampleThriftObject.java|.

The only available constructors are only from other objects, or with _Fields. Not with strings, as the |struct| object had.

public SampleThriftObject() { super(); }

public SampleThriftObject(_Fields setField,Object value) { super(setField, value); }

public SampleThriftObject(SampleThriftObject other) { super(other); }

Further down the definition, there are functions for setting the values within the union:

public void setCola(String value) { if (value== null)throw new NullPointerException(); setField= _Fields.COLA; value= value; }

So, using this information, I'm able to successfully compile:

 val  sampleData  =  Range(1,10).toSeq.map{v:Int  =>
   new  SampleThriftObject().setCol_a("a"  +  v)
 }
┌[bamos☮derecho]-(~/repos/spark-parquet-thrift-example)-[git://master ✗]- └> sbt assembly [info] Loading project definition from /home/bamos/repos/spark-parquet-thrift-example/project [info] Set current project to SparkParquetThrift (in build file:/home/bamos/repos/spark-parquet-thrift-example/) [warn] There may be incompatibilities among your library dependencies. [warn] Here are some of the libraries that were evicted: [warn] * org.apache.thrift:libthrift:0.7.0 -> 0.9.1 [warn] Run 'evicted' to see detailed eviction warnings [info] Including from cache: slf4j-api-1.7.2.jar [info] Including from cache: parquet-hadoop-1.5.0.jar [info] Including from cache: commons-lang-2.4.jar [info] Including from cache: parquet-column-1.5.0.jar [info] Including from cache: parquet-format-2.1.0.jar [info] Including from cache: parquet-common-1.5.0.jar [info] Including from cache: httpcore-4.2.4.jar [info] Including from cache: guava-11.0.1.jar [info] Including from cache: parquet-encoding-1.5.0.jar [info] Including from cache: akka-slf4j_2.10-2.2.3.jar [info] Including from cache: libthrift-0.9.1.jar [info] Including from cache: commons-logging-1.1.1.jar [info] Including from cache: parquet-generator-1.5.0.jar [info] Including from cache: jackson-core-asl-1.9.11.jar [info] Including from cache: commons-codec-1.6.jar [info] Including from cache: commons-lang3-3.1.jar [info] Including from cache: json-simple-1.1.jar [info] Including from cache: hadoop-lzo-0.4.16.jar [info] Including from cache: jsr305-1.3.9.jar [info] Including from cache: httpclient-4.2.5.jar [info] Including from cache: parquet-jackson-1.5.0.jar [info] Including from cache: protobuf-java-2.4.1.jar [info] Including from cache: parquet-thrift-1.5.0.jar [info] Including from cache: elephant-bird-pig-4.4.jar [info] Including from cache: config-1.0.2.jar [info] Including from cache: elephant-bird-core-4.4.jar [info] Including from cache: parquet-pig-1.5.0.jar [info] Including from cache: elephant-bird-hadoop-compat-4.4.jar [info] Including from cache: jackson-mapper-asl-1.9.11.jar [info] Including from cache: akka-actor2.10-2.2.3.jar [info] Including from cache: snappy-java-1.0.5.jar [info] Including from cache: scala-library-2.10.3.jar [info] Checking every .class/_.jar file's SHA-1. [info] Merging files... [warn] Merging 'META-INF/MANIFEST.MF' with strategy 'discard' [info] Strategy 'deduplicate' was applied to 3 files (Run the task at debug level to see details) [warn] Strategy 'discard' was applied to a file [info] Assembly up to date: /home/bamos/repos/spark-parquet-thrift-example/target/scala-2.10/SparkParquetThrift.jar [success] Total time: 2 s, completed Jan 21, 2015 7:17:01 PM

Hope this helps!

Regards, Brandon.

— Reply to this email directly or view it on GitHub https://github.com/adobe-research/spark-parquet-thrift-example/issues/1#issuecomment-70948930.

Mgr. Michal Pitoňák, PhD. Department of Physical and Theoretical Chemistry, Faculty of Natural Sciences of Comenius University Bratislava, Slovakia

Mlynská Dolina 842 15, Bratislava 4 Slovakia

Office: CH1-2-328 tel: +421 908 706 628