Qbeast-io / qbeast-spark

Qbeast-spark: DataSource enabling multi-dimensional indexing and efficient data sampling. Big Data, free from the unnecessary!
https://qbeast.io/qbeast-our-tech/
Apache License 2.0
210 stars 19 forks source link

Unable to supersede IdentityToZeroTransformation and NullToZeroTransformation #224

Open Jiaweihu08 opened 11 months ago

Jiaweihu08 commented 11 months ago

What went wrong?

Both IdentityToZeroTransformation and NullToZeroTransformation are to handle special instances where LinearTransformer is used to map Numeric columns, but the values are either identical or all null. Ideally, these should be superseded when appending "regular" data by LinearTransformation instances. For now, it is not the case.

How to reproduce?

For IdentityToZeroTransformation for instance(and similarly for NullToZeroTransformation):

import org.apache.spark.sql.delta.DeltaLog
import io.qbeast.spark.delta.DeltaQbeastSnapshot
import io.qbeast.core.transform.IdentityToZeroTransformation
import spark.implicits._

case class IdentityCls(col1: String, col2: Int, col3: Double)

val idTestPath = "/tmp/test1/"
val identityData = (1 to 1000).map(_ => IdentityCls("1", 1, 1d)).toDS()
(identityData
    .write
    .mode("overwrite")
    .option("columnsToIndex", "col2")
    .option("cubeSize", "10000")
    .format("qbeast")
    .save(idTestPath)
)

(DeltaQbeastSnapshot(DeltaLog.forTable(spark, idTestPath)
  .update())
  .loadLatestRevision
  .transformations
  .head
  .isInstanceOf[IdentityToZeroTransformation]
) // true

// scala.MatchError at io.qbeast.core.transform.IdentityToZeroTransformation.transform(Transformation.scala:56)
((1 to 1000)
  .map(i => IdentityCls(s"$i", i, i.toDouble))
  .toDS()
  .write
  .mode("append")
  .format("qbeast")
  .save(idTestPath)
)

2. Branch and commit id:

main, f066acf

3. Spark version:

3.4.1

4. Hadoop version:

3.3.4

5. How are you running Spark?

Locally

osopardo1 commented 11 months ago

My initial thoughts on this:

  1. IdentityTransformation should NOT be superseded by another IdentityTransformation. (By definition, the space value of Identity A is not considered in Identity B unless value a and value b are the same).
  2. IdentityTransformation should NOT superseded by a NullToZeroTransformation. Same case as the Identity.
  3. IdentityTransformation might be superseded by a LinearTransformation if max and min cover the identity value.

Now, in cases 1 and 2, we might require a trigger of another type of transformation, such as LinearTransformation, in which we include values from A and B as the ranges.

Does it make sense?