ClickHouse / spark-clickhouse-connector

Spark ClickHouse Connector build on DataSourceV2 API
https://clickhouse.com/docs/en/integrations/apache-spark
Apache License 2.0
188 stars 66 forks source link

Failed to query fields of type Decimal and data is 0 #265

Closed eye-gu closed 1 year ago

eye-gu commented 1 year ago

spark sql: select number_col1 from clickhouse.zhouwq.user_test;

image

log

spark-sql> select number_col1 from clickhouse.zhouwq.user_test; 23/08/10 18:11:38 ERROR Executor: Exception in task 1.0 in stage 0.0 (TID 1) java.lang.NullPointerException at java.math.BigDecimal.(BigDecimal.java:831) at scala.math.BigDecimal$.apply(BigDecimal.scala:290) at xenon.clickhouse.read.format.ClickHouseJsonReader.decodeValue(ClickHouseJsonReader.scala:74) at xenon.clickhouse.read.format.ClickHouseJsonReader.decode(ClickHouseJsonReader.scala:48) at xenon.clickhouse.read.format.ClickHouseJsonReader.decode(ClickHouseJsonReader.scala:33) at xenon.clickhouse.read.ClickHouseReader.get(ClickHouseReader.scala:89) at xenon.clickhouse.read.ClickHouseReader.get(ClickHouseReader.scala:29)

eye-gu commented 1 year ago

image

maybe use asText method is better

pan3793 commented 1 year ago

I think asText does not help, if ClickHouse eliminates data of 0 in result, we should set it to default value explicitly.

scala> BigDecimal("")
java.lang.NumberFormatException
  at java.math.BigDecimal.<init>(BigDecimal.java:599)
  at java.math.BigDecimal.<init>(BigDecimal.java:383)
  at java.math.BigDecimal.<init>(BigDecimal.java:809)
  at scala.math.BigDecimal$.exact(BigDecimal.scala:126)
  at scala.math.BigDecimal$.apply(BigDecimal.scala:284)
  ... 47 elided
pan3793 commented 1 year ago

BTW, thanks for reporting this issue, PR is welcome~

eye-gu commented 1 year ago

I'm not sure what caused it. In my local environment, it works after modifying to 'asText'

pan3793 commented 1 year ago

Could you try this way?

      case d: DecimalType if jsonNode.isBigDecimal =>
        Decimal(jsonNode.decimalValue, d.precision, d.scale)
      case d: DecimalType if jsonNode.isFloat | jsonNode.isDouble =>
        Decimal(BigDecimal(jsonNode.doubleValue, new MathContext(d.precision)), d.precision, d.scale)
+     case d: DecimalType if jsonNode.inInt =>
+       Decimal(BigDecimal(jsonNode.intValue, new MathContext(d.precision)), d.precision, d.scale)
      case d: DecimalType =>
        Decimal(BigDecimal(jsonNode.textValue, new MathContext(d.precision)), d.precision, d.scale)
pan3793 commented 1 year ago

Seems there are other unhandled cases here, like isLong, isBigInteger ...

eye-gu commented 1 year ago

isInt works

pan3793 commented 1 year ago

So that's the root cause, would u like to send PR to fix it, and other potential cases?

eye-gu commented 1 year ago

PTAL

eye-gu commented 1 year ago

I found that the binary format does not have this issue, also #264 issue. Is it more recommended to use binary?

paf91 commented 10 months ago

issue is still present. any updates? @eye-gu why issue is closed? it's still present in 0.7.2 with 0.4.5 jdbc driver with spark 3.4.2

paf91 commented 10 months ago

image

pan3793 commented 10 months ago

@paf91 I need to find some time to publish a new version containing this patch, maybe in a few days, you can try building the master branch before the publish is done.

pan3793 commented 9 months ago

@paf91 FYI, 0.7.3 is available now, it includes this patch.

paf91 commented 9 months ago

@pan3793 could you please tell if you know what's the best practice to save resulting dataframe into clickhouse? I couldn't find it in docs https://housepower.github.io/spark-clickhouse-connector/quick_start/02_play_with_spark_sql/ and the best I could do it use clickhouse-jdbc like df.write.format("jdbc").options(options_map).mode("Overwrite").save

pan3793 commented 9 months ago

@paf91 Well, the "best practices" depend on a lot of stuff. The built-in JDBC data source is maintained by the Apache Spark community as a generic solution for interacting with RDBMS, just keep using it if it works well for your cases.

Instead of providing best practices, I'd like to list some points that I think are worth careful consideration.

Performance in the distributed system is a big topic, I wrote an article to explain how this connector improves the query performance.

Convenience. For example, the data engineer may want to use CREATE/REPLACE TABLE ... AS SELECT ... or equivalent df.writeTo(table).create() and df.writeTo(table).createOrReplace(), to create a ClickHouse table automatically and save data from Spark to ClickHouse, it's impossible without SPARK-43390

Transactions. For a long time, transactions were not strict or even missing in the big data world. In a distributed system, the failure of a single node is normal. Without the guarantee of writing transactions, the resulting retries may lead to eventual data duplication. Have some thoughts about this topic previously https://github.com/housepower/spark-clickhouse-connector/issues/145