Qbeast-io / qbeast-spark

Qbeast-spark: DataSource enabling multi-dimensional indexing and efficient data sampling. Big Data, free from the unnecessary!
https://qbeast.io/qbeast-our-tech/
Apache License 2.0
202 stars 18 forks source link

Issue 193: Update Index Metrics #330

Closed Jiaweihu08 closed 2 weeks ago

Jiaweihu08 commented 1 month ago

Description

Fixes #193 This PR introduces new IndexMetrics statistics to account for the multi-block files.

Changes:

  1. Add transformer type for the indexing columns
  2. Add revision Id
  3. Add revision bytes
  4. Add block and file counts
  5. Add quartiles for several metrics: element counts per cube and block, file bytes
  6. Add quartile for level wise statistics
    
    OTree Index Metrics:
    revisionId: 1
    elementCount: 309008
    dimensionCount: 2
    desiredCubeSize: 3000
    indexingColumns: price:linear,user_id:linear
    height: 8 (4)
    avgFanout: 3.94 (4.0)
    cubeCount: 230
    blockCount: 848
    fileCount: 61
    bytes: 11605372

Multi-block files stats: cubeElementCountStats: (count: 230, avg: 1343, std: 1186, quartiles: (1,292,858,2966,3215)) blockElementCountStats: (count: 848, avg: 364, std: 829, quartiles: (1,6,21,42,3120)) fileBytesStats: (count: 61, avg: 190252, std: 57583, quartiles: (113168,139261,182180,215136,332851)) blockCountPerCubeStats: (count: 230, avg: 3, std: 1, quartiles: (1,4,4,4,4)) blockCountPerFileStats: (count: 61, avg: 13, std: 43, quartiles: (1,3,5,5,207))

Inner cubes depth-wise stats: cubeElementCountStats: (count: 58, avg: 3069, std: 58, quartiles: (2942,3026,3070,3109,3215)) blockElementCountStats: (count: 232, avg: 767, std: 1278, quartiles: (14,26,33,2859,3120)) depth avgCubeElementCount cubeCount blockCount cubeElementCountStd cubeElementCountQuartiles avgWeight
0 3026 1 4 0 (3026,3026,3026,3026,3026) 0.009708486732493268 1 3120 2 8 72 (3048,3048,3193,3193,3193) 0.1777036037942636
2 3088 3 12 91 (3003,3003,3048,3215,3215) 0.29339823296605566 3 3078 7 28 45 (3016,3046,3070,3110,3168) 0.3028351948969015
4 3075 12 48 80 (2942,3027,3084,3148,3214) 0.3853194593029685
5 3056 21 84 45 (2975,3016,3066,3090,3127) 0.557916611995329
6 3071 12 48 37 (2999,3055,3084,3111,3115) 0.71568557500894

Leaf cubes depth-wise stats: cubeElementCountStats: (count: 172, avg: 761, std: 733, quartiles: (1,163,584,1159,2966)) blockElementCountStats: (count: 616, avg: 212, std: 498, quartiles: (1,4,10,36,2877)) depth avgCubeElementCount cubeCount blockCount cubeElementCountStd cubeElementCountQuartiles avgWeight
1 47 2 4 46 (1,1,94,94,94) 1515.9574468085107 2 358 4 12 389 (4,36,424,970,970) 210.8753971341503 3 1010 5 20 994 (314,330,677,767,2966) 5.5998339972396405 4 836 14 45 951 (4,38,315,1612,2552) 106.06026824809581 5 978 27 103 827 (63,299,616,1681,2666) 8.58027316980015
6 811 72 267 686 (7,280,626,1253,2889) 21.17595889195726 7 580 48 165 590 (3,143,515,779,2246) 76.10800345961682


## Checklist:

Here is the list of things you should do before submitting this pull request:

- [ ] New feature / bug fix has been committed following the [Contribution guide](https://github.com/Qbeast-io/qbeast-spark/blob/main/CONTRIBUTING.md).
- [ ] Add logging to the code following the [Contribution guide](https://github.com/Qbeast-io/qbeast-spark/blob/main/CONTRIBUTING.md).
- [ ] Add comments to the code (make it easier for the community!).
- [x] Change the documentation.
- [x] Add tests.
- [x] Your branch is updated to the main branch (dependent changes have been merged).