Qbeast-io / qbeast-spark

Qbeast-spark: DataSource enabling multi-dimensional indexing and efficient data sampling. Big Data, free from the unnecessary!
https://qbeast.io/qbeast-our-tech/
Apache License 2.0
213 stars 19 forks source link

Make files without Metadata readable with Qbeast #121

Closed osopardo1 closed 1 year ago

osopardo1 commented 2 years ago

To be more compatible with underlying Table Formats and set up an easier conversion to Qbeast, we should be able to process files that do not have any Qbeast Metadata on them.

For example

This is a File with Qbeast Metadata:

{
  "add": {
    "path": "d54ba0cd-c315-4388-9bce-fe573f5d0a64.parquet",
    ...
    "tags": {
      "state": "FLOODED",
      "cube": "gw",
      "revision": "1",
      "elementCount": "10836",
      "minWeight": "-1253864150",
      "maxWeight": "1254740128"
    }
  }
}

And this is a file without Qbeast Metadata:

{
  "add": {
    "path": "d54ba0cd-c315-4388-9bce-fe573f5d0a64.parquet",
    ...
    "tags": ""
}

One solution could be the following:

When reading the Delta Log and encountering a file with tags, we put the following synthetic metadata:

val rootTags = Map(
                "maxWeight" -> Weight.MaxValue.value.toString,
                "minWeight" -> Weight.MinValue.value.toString,
                "cube" -> "",
                "state" -> State.FLOODED,
                "revision" -> lastRevisionID.toString,
                "elementCount" -> "0")

This means we are putting all the unknown files onto the last revision root cube with a weight range of [MinValue, MaxValue] ([0.0, 1.0]).

Questions/design decisions:

alexeiakimov commented 2 years ago

Regarding elementCount

  1. If DeltaTable file has Stats, then the value can be obtained as Stats.num_records.
  2. The number of elements in the file is used by sharing protocol to limit the number of records the client can download. In more details the sharing server adds file links to the query result while the sum of elementCount is less then a specified limit.
alexeiakimov commented 2 years ago

Maybe I am wrong the last revision can have (min, max) ranges of the values (later used by liner transformation) which are smaller than the corresponding values ranges of the records from the file. As I remember while indexing if given data does not fit the latest revision space then a new revision is created.

alexeiakimov commented 2 years ago

Let me formulate the last item a different way: can we treat files without Qbeast metadata as indexed? Possibly they are indexed badly, but if they do not violate any invariant, then it is safe to add them to index as if they were indexed.

osopardo1 commented 2 years ago
  1. On element count, unfortunately, we cannot assume that DeltaTable has stats, but it's a workaround for those cases. If no Stats.num_records is written, we could compute a count() for the file, which would have a cost in performance. Another possible solution is to investigate if Parquet files had metadata we could read and avoid the computation.
  2. We want to be able to Convert to Qbeast without the overhead of indexing, and also let the user use other Lakehouse operations of formats underneath without losing information. Yes, the goal of the issue is to treat them as indexed (badly, as you said), and slowly index them correctly (as the index grows). There can be two cases:
    1. A Revision already exists. In this case, the user had done an operation in Delta that affected the DeltaLog, and now he cannot read it correctly. If we put them in the last revision, we need to ensure those files are in the [min, max] range. But doing this computation at the reading time is too expensive (if we don't have any metadata like Stats). That's why putting them in the last revision without knowledge of the space could violate the constraint.
    2. A Revision does not exist. This is the case in which we convert the table from 0 to Qbeast. Here we have more freedom to write the DeltaLog with extra metadata like min-max and element count. But this process it's more for issue #102
osopardo1 commented 1 year ago

UPDATE

From last conversations, we agreed that:

This issue is a dependency of #102

osopardo1 commented 1 year ago

Fixed on #152