COOL-cohort / COOL

the source code of the COOL system
https://www.comp.nus.edu.sg/~dbsystem/cool/
Apache License 2.0
45 stars 16 forks source link

Add cube meta #75

Closed hugy718 closed 2 years ago

hugy718 commented 2 years ago

This PR aims to address Issue #67 .

The overall plan is to add a cubemeta file that describes the value ranges of each field of a cube. The metachunk of each cublet will only have relevant information of its cublet. (This also implies that the same string value of the same field can have different id in different cublet. Similar to the local id in datachunk to improve compression. Now we have the cublet local id).

MetaChunkWS and MetaFieldWS now keeps the field meta info across a load process to generate the cubemeta field in the end.

A new CubeMetaRS is added to read the written out cubemet and describe each field in a json string.

hugy718 commented 2 years ago

Work in progress. Will add loading of cubemeta at CoolModel and expose an API to retrieve the meta info of cube fields there.

hugy718 commented 2 years ago

Limiting the scope of metachunk will affect how CoolTupleReader works. That assumes all data chunks string field id are consistent and stored in the metachunk of the last cublet. This requires us to separate the processing of each cublet, and at least for user key, we keep the global id somewhere. It is easy to do the former, but we are likely to change the representation of a cohort to address Issue #76, after which I think we don't need to keep global id anywhere. (btw supporting update is already breaking CoolTupleReader anw).

hugy718 commented 2 years ago

@KimballCai @NLGithubWP @Zrealshadow. The logic is done. Now, under a cube version directory, there is a cubemeta file like the old dimension file but generated through loading. After we fix those issues mentioned above, we can think about merging this. The draft is for your preview now, as we finalize PR #72.

hugy718 commented 2 years ago

Limiting the scope of metachunk will affect how CoolTupleReader works. That assumes all data chunks string field id are consistent and stored in the metachunk of the last cublet. This requires us to separate the processing of each cublet, and at least for user key, we keep the global id somewhere. It is easy to do the former, but we are likely to change the representation of a cohort to address Issue #76, after which I think we don't need to keep global id anywhere. (btw supporting update is already breaking CoolTupleReader anw).

The CoolTupleReader will not function anw, as we add support for append. It originally has the assumption that users appear only once in monotonic increasing global id. So no need to halt this PR for that. I will address the CoolTupleReader in a separate PR which takes a CohortRS, adjust the id-to-user mapping during cublet switch and handles disjoint user sections in a separate PR.

hugy718 commented 2 years ago

I have made it compatible with the addition of invariant fields

KimballCai commented 2 years ago

As for the API to get all attributes of columns from a certain cube, please use the generateJson function from cool-core/src/main/java/com/nus/cool/core/io/readstore/CubeMetaRS.java.