LucaCanali / Miscellaneous

Includes notes on using Apache Spark in general, notes on using Spark for Physics, how to run TPCDS on PySpark, how to create histograms with Spark, tools for performance testing CPUs, Jupyter notebooks examples for Spark, examples for Oracle and other DB systems.
Apache License 2.0
421 stars 147 forks source link

add jars to hbase server side #5

Open chenbodeng719 opened 1 year ago

chenbodeng719 commented 1 year ago

Add jars to hbase server side according to https://github.com/LucaCanali/Miscellaneous/blob/master/Spark_Notes/Spark_HBase_Connector.md. But it won't work for me. I get error as below. Please help me.

java.lang.NoSuchMethodError: org.apache.hadoop.hbase.spark.protobuf.generated.SparkFilterProtos$SQLPredicatePushDownFilter$Builder.addValueFromQueryArray(Lorg/apache/hbase/thirdparty/com/google/protobuf/ByteString;)Lorg/apache/hadoop/hbase/spark/protobuf/generated/SparkFilterProtos$SQLPredicatePushDownFilter$Builder;
    at org.apache.hadoop.hbase.spark.SparkSQLPushDownFilter.toByteArray(SparkSQLPushDownFilter.java:257)
    at org.apache.hadoop.hbase.spark.datasources.SerializedFilter$.$anonfun$toSerializedTypedFilter$1(HBaseTableScanRDD.scala:273)
    at scala.Option.map(Option.scala:230)
    at org.apache.hadoop.hbase.spark.datasources.SerializedFilter$.toSerializedTypedFilter(HBaseTableScanRDD.scala:273)
    at org.apache.hadoop.hbase.spark.datasources.HBaseTableScanRDD.$anonfun$getPartitions$2(HBaseTableScanRDD.scala:85)
    at scala.collection.TraversableLike.$anonfun$flatMap$1(TraversableLike.scala:245)
    at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
    at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
    at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:38)
    at scala.collection.TraversableLike.flatMap(TraversableLike.scala:245)
    at scala.collection.TraversableLike.flatMap$(TraversableLike.scala:242)
    at scala.collection.AbstractTraversable.flatMap(Traversable.scala:108)
    at org.apache.hadoop.hbase.spark.datasources.HBaseTableScanRDD.getPartitions(HBaseTableScanRDD.scala:77)
LucaCanali commented 1 year ago

A few hints for troubleshooting:

chenbodeng719 commented 1 year ago

@LucaCanali

Did I miss something?

LucaCanali commented 1 year ago

On the client side:

chenbodeng719 commented 1 year ago

On the client side:

  • which version of Spark do you use?
  • do you run it with --jars $JAR1,$JAR2 --packages org.apache.hbase:hbase-shaded-mapreduce:2.4.9 ?

If I set hbase.spark.pushdown.columnfilter false, it's ok. If ture, not ok.

LucaCanali commented 1 year ago

Does it work from the spark-shell?

chenbodeng719 commented 1 year ago

Does it work from the spark-shell?

Same error

chenbodeng719 commented 1 year ago

My error is different with error in md "java.lang.NoClassDefFoundError: org/apache/hadoop/hbase/spark/datasources/JavaBytesEncoder". Maybe it's a server config error?

LucaCanali commented 1 year ago

Can you try using HBase 2.3.x ?

chenbodeng719 commented 1 year ago

Can you try using HBase 2.3.x ?

There is no 2.3.x hbase in aws emr release.

LucaCanali commented 1 year ago

Unfortunately I cannot against HBase (server ) 2.4 yet to test this. I have just compiled the connector jars using Spark 3.3.1 and HBase 2.4.15 and linked the URLs at https://github.com/LucaCanali/Miscellaneous/blob/master/Spark_Notes/Spark_HBase_Connector.md , not sure if it helps though.

chenbodeng719 commented 1 year ago

Considering that our query is more complex than range query and the query may not trigger the push down filter, so now we ignore it. Thanks for your patient. We'll test later. @LucaCanali

amilosevic-grid commented 1 year ago

hi @LucaCanali, have you managed to test against HBase 2.4? I am having this issue pop up:

def catalog = s"""{ |"table":{"namespace":"dev", "name":"amilosevic"}, |"rowkey":"key", |"columns":{ |"col0":{"cf":"rowkey", "col":"key", "type":"binary"}, |"col1":{"cf":"a", "col":"col1", "type":"string"} |} |}""".stripMargin

scala> spark.sqlContext.read.option("catalog",catalog).format("org.apache.hadoop.hbase.spark").load() java.lang.NullPointerException at org.apache.hadoop.hbase.spark.HBaseRelation.(DefaultSource.scala:138) at org.apache.hadoop.hbase.spark.DefaultSource.createRelation(DefaultSource.scala:69) at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:354) at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:326) at org.apache.spark.sql.DataFrameReader.$anonfun$load$3(DataFrameReader.scala:308) at scala.Option.getOrElse(Option.scala:189) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:308) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:226) ... 47 elided

on a local spark 3.3.0 + hbase 2.4.14 using your connector jars for spark 3.3.1(placed in hbase/lib) and in spark-submit

amilosevic-grid commented 1 year ago

after some inspection, this class seems to be failing the predicate pushdown https://github.com/apache/hbase/blob/master/hbase-client/src/main/java/org/apache/hadoop/hbase/shaded/protobuf/ProtobufUtil.java

Caused by: java.lang.ClassCastException: com.google.protobuf.LiteralByteString cannot be cast to org.apache.hbase.thirdparty.com.google.protobuf.ByteString at org.apache.hadoop.hbase.spark.SparkSQLPushDownFilter.parseFrom(SparkSQLPushDownFilter.java:208) ... 12 more