Qbeast-io / qbeast-spark

Qbeast-spark: DataSource enabling multi-dimensional indexing and efficient data sampling. Big Data, free from the unnecessary!
https://qbeast.io/qbeast-our-tech/
Apache License 2.0
213 stars 19 forks source link

Issue #454: Next in line Rollup #455

Open cugni opened 2 weeks ago

cugni commented 2 weeks ago

Description

As described in issue #454, I'm having some issues indexing high-dimensional spaces as the rollup puts all data in the father, which might result in very large files. This PR solves this problem by rolling up between siblings before rolling up to the father. This results in better file size, but as pointed out by @Jiaweihu08, it could make the algorithm more computation-intensive.

Type of change

This PR changes the Rollup algorithm, trying to pack siblings' cubes first and then packing them with the father.

Checklist:

osopardo1 commented 2 weeks ago

I like the idea of making the code more efficient, but this line concerns me:

but as pointed out by @Jiaweihu08, it could make the algorithm more computation-intensive.

It makes it seem very use-case related, and perhaps we don't know yet the full implications of it. I would suggest adding it as configurable and disabling it by default until we are confident enough.

We can do something like:

df.write.format("qbeast").option("rollUpInLine", "true")

osopardo1 commented 2 weeks ago

ps: you can set up main branch as the destination now 😄

fpj commented 2 weeks ago

I also read this sentence in the original issue description, which called my attention:

I suspect this is caused by the roll-up algorithms, which simply push data into the father cube, and, in the case of 8 dimensions, it means a rollup could put together 2^8 + 1= 257 cubes.

Maybe it is just an incorrect choice of words, but suspect implies that we are not sure. How much confidence do we have that this is actually solving the observed problem?

cugni commented 2 weeks ago

@ Jiaweihu08's concerns are valid, but I suspect it is not a real problem. The overhead would be minimal in a low-dimensional space, while in a high-dimensional one, the problem of having huge files counterbalances any additional computation cost.

In any case, let's run this compared with the main version and measure it. If it is considerably slower, we should make this faster rather than optional. We should be able to make the new algorithm as computationally intensive as the old one, with a more invasive code change, using a Trie instead of a HashMap. In any case, this should not be a configuration parameter because the indexing results are always better than the old version (in the worst-case scenario, they are equals).

@fpj This is the ordered list of files with the new version: The largest file is 15GB, while in the old version, the biggest file was 220G, the second 106G, etc. (in this dataset, some files are bigger, because some rows are larger ~256MB)

[qbst363636@glogin2 0.8.0_8dim_10_bites_10kcs_2_8d_filtered]$ ls -lSh
total 463G
-rw-r--r-- 1 qbst363636 qbst01   15G Oct 27 19:28 4e416182-1bf1-4dd3-8338-2a298efe1188.parquet
-rw-r--r-- 1 qbst363636 qbst01  9.2G Oct 27 19:28 803819f6-4869-4802-843f-442ef5565104.parquet
-rw-r--r-- 1 qbst363636 qbst01  5.7G Oct 27 19:27 c74c8164-78a3-44e9-aad6-56a38a9bfe1a.parquet
-rw-r--r-- 1 qbst363636 qbst01  4.3G Oct 27 19:27 fd2f0703-b3f4-4ff3-bcfe-f2581edf1c3c.parquet
-rw-r--r-- 1 qbst363636 qbst01  4.0G Oct 27 19:27 d9d3c235-a6e4-4e56-8c22-fccd7b81eee1.parquet
-rw-r--r-- 1 qbst363636 qbst01  4.0G Oct 27 19:27 58774def-88f8-46b7-904e-cd5f842e52ed.parquet
-rw-r--r-- 1 qbst363636 qbst01  3.8G Oct 27 19:27 74a985e7-7170-4636-a351-30fc75443d0c.parquet
-rw-r--r-- 1 qbst363636 qbst01  3.6G Oct 27 19:27 e49aa2c4-e702-4df3-8800-d70c5f159d7d.parquet
-rw-r--r-- 1 qbst363636 qbst01  3.3G Oct 27 19:27 34f1ee42-9f33-481a-aa46-fe1d8af2473e.parquet

When I have time, I'll run a more detailed analysis. You can also use the script posted in #454 to compare the two versions. You should look for the number of cubes in a file and the actual file sizes (which are not reported in the IndexMetrics right now).

Jiaweihu08 commented 2 days ago

Concern number 1:

The priority queue is ordered by depth while nextSibling is probably following the z-order.

For the algorithm to work correctly, we must ensure the cubes are checked in the same order, i.e., that of the z-order.

Given a set of cubes, say A, and its child cubes A1, A2, A3, and A4 with A1.nextSibling = A2, and A2.nextSibling = A3, etc.

If the first four queue.deque calls give us A4, A3, A2, and A1, then this algorithm will produce the same results as directly putting everything to their parent.

Upon closer inspection, we see that a cube's group is not removed from groups. So, A3 can still consider A4's group after it has been merged with A, which is erroneous.

Concern number 2:

When indexing a large number of columns, the number of sibling cubes to consider can be high. This is true even the sibling cubes don't exist in groups.

cugni commented 2 days ago

Thanks @Jiaweihu08 ! For the concern number 1, I guess it would suffice to sort the cubes for depth and alphabetical order. Is that right?

I don't understand the comment: Upon closer inspection, we see that a cube's group is not removed from groups. So, A3 can still consider A4's group after it has been merged with A, which is erroneous. This would happen only if they are not sorted, isn't?

On concert number 2, yes, on a high dimension that could be an issue, but we should measure it. As I commented, it can be easily fixed by using a Trie Map, instead of a hash Map, or maybe simply iterating on the queue.

I'll try to fix this without too many changes.

Jiaweihu08 commented 1 day ago

Upon closer inspection, we see that a cube's group is not removed from groups. So, A3 can still consider A4's group after it has been merged with A, which is erroneous.

As the code is written, A4 remains in the groups after merging with A. If you then try to find a rollup cube for A3, it will look for A4, whose values are already in A, but A3 is unaware of it.

Jiaweihu08 commented 1 day ago

Thanks @Jiaweihu08 ! For the concern number 1, I guess it would suffice to sort the cubes for depth and alphabetical order. Is that right?

I don't understand the comment: Upon closer inspection, we see that a cube's group is not removed from groups. So, A3 can still consider A4's group after it has been merged with A, which is erroneous. This would happen only if they are not sorted, isn't?

On concert number 2, yes, on a high dimension that could be an issue, but we should measure it. As I commented, it can be easily fixed by using a Trie Map, instead of a hash Map, or maybe simply iterating on the queue.

I'll try to fix this without too many changes.

CubeIds can be sorted; just create the Priority Queue according to CubeIds, in reverse order. But we need to check if the way cubeIds are popped from the queue is actually in line with nextSibling.