Open That3Percent opened 4 years ago
I think I'll have a clearer picture of #5 once #2 is resolved (assuming no other blockers for me), so I'm glad to hear that #5 is a ways off, as I don't want to lead you down a path that isn't actually needed.
In regards to append only/separate files - I'll note that my data has about 124M records in one data set (9 records for every ball hit into play in any affiliated baseball game since 2005) and the other has about 55M records (every pitch since 2005). I don't know what the implications of this are, or if it even matters? There are a lot of natural ways to chunk the data (years, months, days, MLB vs AAA vs AA etc.) so this is something to explore once I get it to work.
I think you should be unblocked now with #2 closed.
With the way Tree-Buf is implemented now, chunking is a great idea. Hopefully, this can be automatic someday. But with your need to append data in batches manual chunking just makes sense. Keep me up-to-date on what chunking strategy works the best with your data set to inform the design of auto-chunking in #4.
Once you've got a file written, you can use an internal API to get diagnostics on the breakdown of the size of data per column:
let tree = tree_buf::internal::read_root(&tb_bytes);
dbg!(tree.unwrap());
This will help inform us of what is working well for the compression and what future improvements will have the biggest bang for the buck. My bet is that RLE compression in #7 will make a big difference.
It works! I included the assert_eq! from your Readme and that passed.
I tested it on a subset of the data (about 10% of 124 Million records). Converting to tree-buf took about 45 seconds (which is reasonable from my perspective).
Original CSV: 48 GB (so it pulled in about 4.8GB) Tableau .hyper: 6.2 GB total (when it converted the entire CSV) tree-buf: 2.7 GB on the sample set (compared to the roughly 4.8 GB sample)
Compression is a WIP I assume?
Edit: forgot to run the diagnostics, let me do that now
tree.unwrap() = Array {
len: 12400000,
values: Object {
fields: {
"teamId": Integer(
ArrayInteger {
bytes: Bytes(
18719224,
),
encoding: Simple16,
},
),
"fielderHeightIn": Integer(
ArrayInteger {
bytes: Bytes(
12400000,
),
encoding: Simple16,
},
),
"venueLeftCenter": Nullable {
opt: Bytes(
1550000,
),
values: Integer(
ArrayInteger {
bytes: Bytes(
3655860,
),
encoding: Simple16,
},
),
},
"batter": Integer(
ArrayInteger {
bytes: Bytes(
37200000,
),
encoding: PrefixVarInt,
},
),
"gameType": Enum {
discriminants: Integer(
ArrayInteger {
bytes: Bytes(
0,
),
encoding: Simple16,
},
),
variants: [
ArrayEnumVariant {
ident: "r",
data: Void,
},
],
},
"venueId": Integer(
ArrayInteger {
bytes: Bytes(
22177800,
),
encoding: Simple16,
},
),
"runs": Integer(
ArrayInteger {
bytes: Bytes(
1877132,
),
encoding: Simple16,
},
),
"batterBatsDesc": Nullable {
opt: Bytes(
1550000,
),
values: Enum {
discriminants: Integer(
ArrayInteger {
bytes: Bytes(
1771432,
),
encoding: Simple16,
},
),
variants: [
ArrayEnumVariant {
ident: "left",
data: Void,
},
ArrayEnumVariant {
ident: "right",
data: Void,
},
],
},
},
"hitDataExitVelocity": Void,
"fieldedById": Nullable {
opt: Bytes(
1550000,
),
values: Integer(
ArrayInteger {
bytes: Bytes(
35991429,
),
encoding: PrefixVarInt,
},
),
},
"fielderBirthCountry": Nullable {
opt: Bytes(
1550000,
),
values: String(
Bytes(
81080372,
),
),
},
"venueRightCenter": Nullable {
opt: Bytes(
1550000,
),
values: Integer(
ArrayInteger {
bytes: Bytes(
3655860,
),
encoding: Simple16,
},
),
},
"pitcherThrowsDesc": Nullable {
opt: Bytes(
1550000,
),
values: Enum {
discriminants: Integer(
ArrayInteger {
bytes: Bytes(
1771432,
),
encoding: Simple16,
},
),
variants: [
ArrayEnumVariant {
ident: "left",
data: Void,
},
ArrayEnumVariant {
ident: "right",
data: Void,
},
],
},
},
"venueSurface": Nullable {
opt: Bytes(
1550000,
),
values: Enum {
discriminants: Integer(
ArrayInteger {
bytes: Bytes(
1771432,
),
encoding: Simple16,
},
),
variants: [
ArrayEnumVariant {
ident: "grass",
data: Void,
},
ArrayEnumVariant {
ident: "artificial",
data: Void,
},
],
},
},
"fielderDraftPickNumber": Nullable {
opt: Bytes(
1550000,
),
values: Integer(
ArrayInteger {
bytes: Bytes(
13676816,
),
encoding: Simple16,
},
),
},
"fielderThrowsCode": Nullable {
opt: Bytes(
1550000,
),
values: Enum {
discriminants: Integer(
ArrayInteger {
bytes: Bytes(
1771428,
),
encoding: Simple16,
},
),
variants: [
ArrayEnumVariant {
ident: "r",
data: Void,
},
ArrayEnumVariant {
ident: "l",
data: Void,
},
],
},
},
"pitcherName": String(
Bytes(
169118682,
),
),
"sportCode": String(
Bytes(
49600000,
),
),
"sportName": String(
Bytes(
113485368,
),
),
"parentTeamId": Integer(
ArrayInteger {
bytes: Bytes(
13837484,
),
encoding: Simple16,
},
),
"pitcher": Integer(
ArrayInteger {
bytes: Bytes(
37200000,
),
encoding: PrefixVarInt,
},
),
"hitDataCalcDistance": Nullable {
opt: Bytes(
1550000,
),
values: Float(
DoubleGorilla(
Bytes(
7561831,
),
),
),
},
"venueName": String(
Bytes(
236053544,
),
),
"parentTeamName": String(
Bytes(
203484436,
),
),
"venueRightLine": Nullable {
opt: Bytes(
1550000,
),
values: Integer(
ArrayInteger {
bytes: Bytes(
15794304,
),
encoding: Simple16,
},
),
},
"hitDataSprayAngle": Nullable {
opt: Bytes(
1550000,
),
values: Float(
DoubleGorilla(
Bytes(
7804952,
),
),
),
},
"venueRetrosheetId": String(
Bytes(
25633194,
),
),
"outsEnd": Integer(
ArrayInteger {
bytes: Bytes(
3171240,
),
encoding: Simple16,
},
),
"fielderWeight": Nullable {
opt: Bytes(
1550000,
),
values: Integer(
ArrayInteger {
bytes: Bytes(
16532812,
),
encoding: Simple16,
},
),
},
"fielderName": String(
Bytes(
169622188,
),
),
"fielderCollegeName": Nullable {
opt: Bytes(
1550000,
),
values: String(
Bytes(
94095404,
),
),
},
"batterBats": Enum {
discriminants: Integer(
ArrayInteger {
bytes: Bytes(
1771432,
),
encoding: Simple16,
},
),
variants: [
ArrayEnumVariant {
ident: "l",
data: Void,
},
ArrayEnumVariant {
ident: "r",
data: Void,
},
],
},
"fieldedByPos": Nullable {
opt: Bytes(
1550000,
),
values: Enum {
discriminants: Integer(
ArrayInteger {
bytes: Bytes(
4796112,
),
encoding: Simple16,
},
),
variants: [
ArrayEnumVariant {
ident: "secondBase",
data: Void,
},
ArrayEnumVariant {
ident: "centerField",
data: Void,
},
ArrayEnumVariant {
ident: "thirdBase",
data: Void,
},
ArrayEnumVariant {
ident: "rightField",
data: Void,
},
ArrayEnumVariant {
ident: "leftField",
data: Void,
},
ArrayEnumVariant {
ident: "shortStop",
data: Void,
},
ArrayEnumVariant {
ident: "firstBase",
data: Void,
},
ArrayEnumVariant {
ident: "catcher",
data: Void,
},
ArrayEnumVariant {
ident: "pitcher",
data: Void,
},
],
},
},
"gameDate": String(
Bytes(
120400104,
),
),
"sportAbbr": String(
Bytes(
39705403,
),
),
"baseValueStart": Integer(
ArrayInteger {
bytes: Bytes(
2935120,
),
encoding: Simple16,
},
),
"hitDataTrajectory": Nullable {
opt: Bytes(
1550000,
),
values: Enum {
discriminants: Integer(
ArrayInteger {
bytes: Bytes(
2483040,
),
encoding: Simple16,
},
),
variants: [
ArrayEnumVariant {
ident: "groundBall",
data: Void,
},
ArrayEnumVariant {
ident: "flyBall",
data: Void,
},
ArrayEnumVariant {
ident: "lineDrive",
data: Void,
},
ArrayEnumVariant {
ident: "popUp",
data: Void,
},
ArrayEnumVariant {
ident: "unknown",
data: Void,
},
],
},
},
"fielderMlbDebut": String(
Bytes(
66528692,
),
),
"leagueName": Nullable {
opt: Bytes(
1550000,
),
values: String(
Bytes(
220077042,
),
),
},
"sportLevelOfPlay": Integer(
ArrayInteger {
bytes: Bytes(
3355136,
),
encoding: Simple16,
},
),
"hitDataTotalDistance": Void,
"fieldedByName": String(
Bytes(
164744437,
),
),
"fielderHeightStr": Nullable {
opt: Bytes(
1550000,
),
values: String(
Bytes(
76926492,
),
),
},
"venueRight": Nullable {
opt: Bytes(
1550000,
),
values: Integer(
ArrayInteger {
bytes: Bytes(
1410348,
),
encoding: Simple16,
},
),
},
"hitDataLaunchAngle": Void,
"ballsStart": Integer(
ArrayInteger {
bytes: Bytes(
2064116,
),
encoding: Simple16,
},
),
"hitDataContactQuality": Nullable {
opt: Bytes(
1550000,
),
values: Enum {
discriminants: Integer(
ArrayInteger {
bytes: Bytes(
1794092,
),
encoding: Simple16,
},
),
variants: [
ArrayEnumVariant {
ident: "medium",
data: Void,
},
ArrayEnumVariant {
ident: "soft",
data: Void,
},
ArrayEnumVariant {
ident: "hard",
data: Void,
},
],
},
},
"pitcherThrows": Enum {
discriminants: Integer(
ArrayInteger {
bytes: Bytes(
1771432,
),
encoding: Simple16,
},
),
variants: [
ArrayEnumVariant {
ident: "l",
data: Void,
},
ArrayEnumVariant {
ident: "r",
data: Void,
},
],
},
"strikesStart": Integer(
ArrayInteger {
bytes: Bytes(
2074300,
),
encoding: Simple16,
},
),
"fielder": Integer(
ArrayInteger {
bytes: Bytes(
37200000,
),
encoding: PrefixVarInt,
},
),
"baseValueEnd": Integer(
ArrayInteger {
bytes: Bytes(
3177616,
),
encoding: Simple16,
},
),
"venueRoof": Nullable {
opt: Bytes(
1550000,
),
values: Enum {
discriminants: Integer(
ArrayInteger {
bytes: Bytes(
1170912,
),
encoding: Simple16,
},
),
variants: [
ArrayEnumVariant {
ident: "open",
data: Void,
},
ArrayEnumVariant {
ident: "retractable",
data: Void,
},
ArrayEnumVariant {
ident: "dome",
data: Void,
},
],
},
},
"venueLeftLine": Nullable {
opt: Bytes(
1550000,
),
values: Integer(
ArrayInteger {
bytes: Bytes(
15794304,
),
encoding: Simple16,
},
),
},
"venueLeft": Nullable {
opt: Bytes(
1550000,
),
values: Integer(
ArrayInteger {
bytes: Bytes(
1742496,
),
encoding: Simple16,
},
),
},
"sportAffilliation": Enum {
discriminants: Integer(
ArrayInteger {
bytes: Bytes(
1773876,
),
encoding: Simple16,
},
),
variants: [
ArrayEnumVariant {
ident: "mlb",
data: Void,
},
ArrayEnumVariant {
ident: "minors",
data: Void,
},
ArrayEnumVariant {
ident: "unaffiliated",
data: Void,
},
],
},
"venueCenter": Nullable {
opt: Bytes(
1550000,
),
values: Integer(
ArrayInteger {
bytes: Bytes(
15700728,
),
encoding: Simple16,
},
),
},
"venueCity": String(
Bytes(
118474096,
),
),
"fielderDob": String(
Bytes(
123625176,
),
),
"doublePlayOpp": Boolean(
Bytes(
1550000,
),
),
"batterName": String(
Bytes(
169605188,
),
),
"venueCapacity": Nullable {
opt: Bytes(
1550000,
),
values: Integer(
ArrayInteger {
bytes: Bytes(
26901095,
),
encoding: PrefixVarInt,
},
),
},
"sportId": Integer(
ArrayInteger {
bytes: Bytes(
6263900,
),
encoding: Simple16,
},
),
"inPlayResult": Nullable {
opt: Bytes(
1550000,
),
values: Enum {
discriminants: Integer(
ArrayInteger {
bytes: Bytes(
4651356,
),
encoding: Simple16,
},
),
variants: [
ArrayEnumVariant {
ident: "groundOut",
data: Void,
},
ArrayEnumVariant {
ident: "single",
data: Void,
},
ArrayEnumVariant {
ident: "flyOut",
data: Void,
},
ArrayEnumVariant {
ident: "forceOut",
data: Void,
},
ArrayEnumVariant {
ident: "double",
data: Void,
},
ArrayEnumVariant {
ident: "sacFly",
data: Void,
},
ArrayEnumVariant {
ident: "fieldError",
data: Void,
},
ArrayEnumVariant {
ident: "doublePlay",
data: Void,
},
ArrayEnumVariant {
ident: "popOut",
data: Void,
},
ArrayEnumVariant {
ident: "lineOut",
data: Void,
},
ArrayEnumVariant {
ident: "homeRun",
data: Void,
},
ArrayEnumVariant {
ident: "triple",
data: Void,
},
ArrayEnumVariant {
ident: "buntPopOut",
data: Void,
},
ArrayEnumVariant {
ident: "sacBunt",
data: Void,
},
ArrayEnumVariant {
ident: "batterInterference",
data: Void,
},
ArrayEnumVariant {
ident: "fieldersChoice",
data: Void,
},
ArrayEnumVariant {
ident: "buntGroundOut",
data: Void,
},
ArrayEnumVariant {
ident: "fanInterference",
data: Void,
},
ArrayEnumVariant {
ident: "triplePlay",
data: Void,
},
ArrayEnumVariant {
ident: "sacFlyDoublePlay",
data: Void,
},
ArrayEnumVariant {
ident: "other",
data: Void,
},
ArrayEnumVariant {
ident: "strikeOut",
data: Void,
},
ArrayEnumVariant {
ident: "pitchingSubstitution",
data: Void,
},
ArrayEnumVariant {
ident: "walk",
data: Void,
},
ArrayEnumVariant {
ident: "catcherInterference",
data: Void,
},
ArrayEnumVariant {
ident: "hitByPitch",
data: Void,
},
ArrayEnumVariant {
ident: "intentionalWalk",
data: Void,
},
],
},
},
"outsStart": Integer(
ArrayInteger {
bytes: Bytes(
2696628,
),
encoding: Simple16,
},
),
"position": Enum {
discriminants: Integer(
ArrayInteger {
bytes: Bytes(
5511164,
),
encoding: Simple16,
},
),
variants: [
ArrayEnumVariant {
ident: "catcher",
data: Void,
},
ArrayEnumVariant {
ident: "firstBase",
data: Void,
},
ArrayEnumVariant {
ident: "secondBase",
data: Void,
},
ArrayEnumVariant {
ident: "thirdBase",
data: Void,
},
ArrayEnumVariant {
ident: "shortStop",
data: Void,
},
ArrayEnumVariant {
ident: "leftField",
data: Void,
},
ArrayEnumVariant {
ident: "rightField",
data: Void,
},
ArrayEnumVariant {
ident: "centerField",
data: Void,
},
ArrayEnumVariant {
ident: "pitcher",
data: Void,
},
],
},
"teamName": String(
Bytes(
225266477,
),
),
"fielderThrowsDesc": Nullable {
opt: Bytes(
1550000,
),
values: Enum {
discriminants: Integer(
ArrayInteger {
bytes: Bytes(
1771428,
),
encoding: Simple16,
},
),
variants: [
ArrayEnumVariant {
ident: "right",
data: Void,
},
ArrayEnumVariant {
ident: "left",
data: Void,
},
],
},
},
},
},
}
This is great progress!
Yes, you are right that the compression (and everything else) is WIP. There are all kinds of possible improvements - but one of the principles of Tree-Buf is that its design is data-driven. This data identifies what compression features will give the biggest bang for the buck.
A few points stand out:
DoubleGorilla
compression (floats). You can likely drop this by half with lossy float compression enabled. But, the gains would not be significant until improvements to more dominant fields are made.Simple16
encoding. These could be lowered further to bool for a better encoding. Not a huge win, but it should save 12% at a minimum for those fields.gameDate
should not be stored as a String. Tree-Buf doesn't support any native date type yet, but this is a common need. I'm not sure yet if I want to have a particular date encoding or implement tags on top of existing encodings.There's probably a bunch more insight here, but this is enough to be busy for a while and re-evaluate after these are implemented. I'm in the middle of writing the Gorilla
encoder from scratch to be a lot faster. Next I will spin up issues for all of these.
Added:
Excited to see the results once dictionary compression is in, that should provide huge wins for this data set which is highly repetitive (mostly names). Honestly, just being able to slap a few derives and then read/write is incredible ergonomically and opens up a lot of possibilities for me.
I'll see if I can get the main data set (much wider and bulkier, less repetitive) to work as well. Do you want me to post the diagnostics in #8?
Let's wait until the new size diagnostics are available, then post a sample of the main data set in this issue. It's easier to track the BOSS story here since that issue is going to be closed.
We now have dictionary compression and the new size diagnostics API on master.
You can now use:
let sizes = tree_buf::experimental::stats::size_breakdown(&tb_bytes);
println!("{}", sizes.unwrap());
And it will print something like...
Largest by path:
32000 U8 Fixed data.orders.id
5000 Prefix Varint data.orders.price
5000 Prefix Varint data.orders.createdAt
2836 UTF-8 data.orders.nft.wearable.representationId.values
2452 UTF-8 data.orders.nft.wearable.name.values
1014 Prefix Varint data.orders.nft.wearable.representationId.indices
1013 Prefix Varint data.orders.nft.wearable.name.indices
1000 Prefix Varint data.orders.nft.wearable.collection.indices
420 Simple16 data.orders.nft.wearable.category
288 Simple16 data.orders.nft.wearable.rarity
288 Simple16 data.orders.nft.wearable.bodyShapes.len
272 Simple16 data.orders.nft.wearable.bodyShapes.values
268 Simple16 data.orders.status
85 UTF-8 data.orders.nft.wearable.collection.values
0 Simple16 data.orders.nft.wearable.owner.mana
Largest by type:
1x 32000 @ U8 Fixed
5x 13027 @ Prefix Varint
3x 5373 @ UTF-8
6x 1536 @ Simple16
Other: 403
Total: 52339
I expect that the dictionary compression, while not yet perfect, will still be a huge reduction in the size of the file.
You work fast!
I get the following error when compiling against #217a8b22:
error[E0599]: no associated item named `MAX` found for type `usize` in the current scope
--> C:\Users\Eli\.cargo\git\checkouts\tree-buf-402f6dec423c055a\217a8b2\tree-buf\src\experimental\stats.rs:53:40
|
53 | by_type.sort_by_key(|i| usize::MAX - i.1.size);
| ^^^ associated item not found in `usize`
|
help: you are looking for the module in `std`, not the primitive type
|
53 | by_type.sort_by_key(|i| std::usize::MAX - i.1.size);
| ^^^^^^^^^^^^^^^
error: aborting due to 2 previous errors
Run rustup update
. The value usize::MAX
was made available as of Rust version 1.43.0
Must have missed that update. We're down to 735MB from the 2.7GB version tested last time (CSV is about 4.7GB). This is very close to the level of compression Tableau got. Diagnostics in next comment.
Largest by path:
37200000 Prefix Varint fielder
37200000 Prefix Varint batter
37200000 Prefix Varint pitcher
35991429 Prefix Varint fieldedById.values
26901095 Prefix Varint venueCapacity.values
24242995 Prefix Varint fielderName.indices
24139292 Prefix Varint pitcherName.indices
24079353 Prefix Varint fielderDob.indices
24017288 Prefix Varint batterName.indices
23869379 Prefix Varint fielderMlbDebut.indices
23597718 Prefix Varint fieldedByName.indices
22177800 Simple16 venueId
19310669 Prefix Varint gameDate.indices
18719224 Simple16 teamId
16532812 Simple16 fielderWeight.values
16158292 Prefix Varint teamName.indices
16082701 Prefix Varint venueName.indices
15794304 Simple16 venueLeftLine.values
15794304 Simple16 venueRightLine.values
15709408 Prefix Varint venueCity.indices
15700728 Simple16 venueCenter.values
13837484 Simple16 parentTeamId
13676816 Simple16 fielderDraftPickNumber.values
12400000 Prefix Varint sportName.indices
12400000 Simple16 fielderHeightIn
12400000 Prefix Varint sportCode.indices
12400000 Prefix Varint sportAbbr.indices
12400000 Prefix Varint leagueName.values.indices
12399166 Prefix Varint fielderBirthCountry.values.indices
12398834 Prefix Varint fielderHeightStr.values.indices
12398748 Prefix Varint parentTeamName.indices
12366879 Prefix Varint venueRetrosheetId.indices
9916026 Prefix Varint fielderCollegeName.values.indices
7804952 Gorilla hitDataSprayAngle.values
7561831 Gorilla hitDataCalcDistance.values
6263900 Simple16 sportId
5511164 Simple16 position
4796112 Simple16 fieldedByPos.values
4651356 Simple16 inPlayResult.values
3655860 Simple16 venueLeftCenter.values
3655860 Simple16 venueRightCenter.values
3355136 Simple16 sportLevelOfPlay
3177616 Simple16 baseValueEnd
3171240 Simple16 outsEnd
2935120 Simple16 baseValueStart
2696628 Simple16 outsStart
2483040 Simple16 hitDataTrajectory.values
2074300 Simple16 strikesStart
2064116 Simple16 ballsStart
1877132 Simple16 runs
1794092 Simple16 hitDataContactQuality.values
1773876 Simple16 sportAffilliation
1771432 Simple16 batterBats
1771432 Simple16 pitcherThrowsDesc.values
1771432 Simple16 batterBatsDesc.values
1771432 Simple16 venueSurface.values
1771432 Simple16 pitcherThrows
1771428 Simple16 fielderThrowsDesc.values
1771428 Simple16 fielderThrowsCode.values
1742496 Simple16 venueLeft.values
1550000 Packed Boolean fielderDraftPickNumber
1550000 Packed Boolean fielderCollegeName
1550000 Packed Boolean leagueName
1550000 Packed Boolean fielderBirthCountry
1550000 Packed Boolean fielderThrowsDesc
1550000 Packed Boolean fielderHeightStr
1550000 Packed Boolean venueSurface
1550000 Packed Boolean hitDataContactQuality
1550000 Packed Boolean batterBatsDesc
1550000 Packed Boolean doublePlayOpp
1550000 Packed Boolean venueRoof
1550000 Packed Boolean hitDataTrajectory
1550000 Packed Boolean venueLeftCenter
1550000 Packed Boolean pitcherThrowsDesc
1550000 Packed Boolean fielderWeight
1550000 Packed Boolean venueLeft
1550000 Packed Boolean venueCenter
1550000 Packed Boolean hitDataSprayAngle
1550000 Packed Boolean venueRight
1550000 Packed Boolean fieldedById
1550000 Packed Boolean hitDataCalcDistance
1550000 Packed Boolean fielderThrowsCode
1550000 Packed Boolean venueCapacity
1550000 Packed Boolean venueRightCenter
1550000 Packed Boolean venueRightLine
1550000 Packed Boolean venueLeftLine
1550000 Packed Boolean fieldedByPos
1550000 Packed Boolean inPlayResult
1410348 Simple16 venueRight.values
1170912 Simple16 venueRoof.values
116871 UTF-8 fielderName.values
112456 UTF-8 fieldedByName.values
70674 UTF-8 batterName.values
63152 UTF-8 pitcherName.values
42997 UTF-8 fielderDob.values
16604 UTF-8 fielderMlbDebut.values
13779 UTF-8 fielderCollegeName.values.values
4517 UTF-8 teamName.values
4495 UTF-8 venueName.values
4297 UTF-8 gameDate.values
2003 UTF-8 venueCity.values
725 UTF-8 parentTeamName.values
398 UTF-8 leagueName.values.values
343 UTF-8 fielderBirthCountry.values.values
217 UTF-8 venueRetrosheetId.values
148 UTF-8 fielderHeightStr.values.values
113 UTF-8 sportName.values
36 UTF-8 sportCode.values
29 UTF-8 sportAbbr.values
0 Simple16 gameType
Largest by type:
24x 494779272 @ Prefix Varint
37x 217293792 @ Simple16
28x 43400000 @ Packed Boolean
2x 15366783 @ Gorilla
19x 453854 @ UTF-8
Other: 2227
Total: 771295928
Pushed a couple of improvements:
bool
, integers, and String
I don't expect these to move the needle too much but it should help. I've taken on a bit too much technical debt. Adding other features is running into confusing problems and hacks (actually these too). So, this will be a good time to checkpoint and let you know this is as good as it's going to get for a short while until I pay off some of that debt. Shouldn't take too long.
It goes without saying, but please don't feel any pressure on account of me. Take your time and enjoy the process.
I think https://github.com/That3Percent/tree-buf/commit/1afd77827bb74b4ae20c7c13c154037fda796a67 introduced a bug. File de-compressed fine for https://github.com/That3Percent/tree-buf/commit/217a8b229f130b70f27a79975389c9d19d9cd186 but throws the following error for all revisions after that:
thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: InvalidFormat'
Edit: realized it might be helpful to have my code in here:
println!("Converting to treebuf...");
let bytes = write(&defense_data);
println!("Checking file integrity...");
let copy: Vec<boss::defense::Defense> = read(&bytes).unwrap();
assert_eq!(©, &defense_data);
Thanks for understanding! I've been setting up this "restaurant" for 6 months - aging the spices and slow cooking the sauces. You're my first customer so I want to make sure you are happy!
Pretty sure I fixed the problem, and there are some new compression features for you as well:
Still going to need to do some cleanup on my end soon because it's a hack that fixes the problem.
At this point, it's already passed the "MVP" stage for what I need, so anything else is (slow cooked) gravy. I know how important having a good test data set is, especially one that hasn't been developed against, so this is my way of contributing.
Updated diagnostics to follow, looks like RLE had a rather large benefit in my data set, but that is likely due to the very repetitive nature of my data (doubt other sets will be like this). When I get around to getting the main data set to work, I'll post that.
Same sample size as above:
Largest by path:
37200000
[12400000].fielder
Array.Object.Prefix Varint
23841192
[12400000].fielderName.indices
Array.Object.Dictionary.Simple16
23704360
[12400000].fielderDob.indices
Array.Object.Dictionary.Simple16
23389660
[12400000].fielderMlbDebut.indices
Array.Object.Dictionary.Simple16
16532812
[12400000].fielderWeight.values
Array.Object.Nullable.Simple16
13676816
[12400000].fielderDraftPickNumber.values
Array.Object.Nullable.Simple16
12400000
[12400000].fielderHeightIn
Array.Object.Simple16
8845128
[12400000].fielderCollegeName.values.indices
Array.Object.Nullable.Dictionary.Simple16
7804952
[12400000].hitDataSprayAngle.values
Array.Object.Nullable.Gorilla
7561831
[12400000].hitDataCalcDistance.values
Array.Object.Nullable.Gorilla
6510716
[12400000].fielderHeightStr.values.indices
Array.Object.Nullable.Dictionary.Simple16
5540052
[12400000].fielderBirthCountry.values.indices
Array.Object.Nullable.Dictionary.Simple16
5511164
[12400000].position.discriminants
Array.Object.Enum.Simple16
4131528
[12400000].batter.values
Array.Object.RLE.Prefix Varint
3615321
[12400000].fieldedById.values.values
Array.Object.Nullable.RLE.Prefix Varint
2606724
[12400000].batterName.indices.values
Array.Object.Dictionary.RLE.Simple16
2383628
[12400000].fieldedByName.indices.values
Array.Object.Dictionary.RLE.Simple16
1550000
[12400000].fielderThrowsCode.values.discriminants
Array.Object.Nullable.Enum.Packed Boolean
1550000
[12400000].fielderThrowsDesc.values.discriminants
Array.Object.Nullable.Enum.Packed Boolean
1550000
[12400000].fielderDraftPickNumber.opt
Array.Object.Nullable.Packed Boolean
1550000
[12400000].fielderCollegeName.opt
Array.Object.Nullable.Packed Boolean
1395120
[12400000].pitcher.values
Array.Object.RLE.Prefix Varint
877604
[12400000].pitcherName.indices.values
Array.Object.Dictionary.RLE.Simple16
708872
[12400000].teamId.values
Array.Object.RLE.Simple16
662888
[12400000].batterBatsDesc.values.discriminants.runs
Array.Object.Nullable.Enum.Bool RLE.Prefix Varint
662888
[12400000].batterBats.discriminants.runs
Array.Object.Enum.Bool RLE.Prefix Varint
595724
[12400000].inPlayResult.values.discriminants.values
Array.Object.Nullable.Enum.RLE.Simple16
582020
[12400000].baseValueEnd.runs
Array.Object.RLE.Simple16
544268
[12400000].fieldedByPos.values.discriminants.values
Array.Object.Nullable.Enum.RLE.Simple16
512636
[12400000].parentTeamId.values
Array.Object.RLE.Simple16
454928
[12400000].teamName.indices.values
Array.Object.Dictionary.RLE.Simple16
406054
[12400000].doublePlayOpp.runs
Array.Object.Bool RLE.Prefix Varint
395168
[12400000].pitcherName.indices.runs
Array.Object.Dictionary.RLE.Simple16
395168
[12400000].pitcher.runs
Array.Object.RLE.Simple16
385184
[12400000].venueName.values
Array.Object.RLE.UTF-8
372556
[12400000].teamName.indices.runs
Array.Object.Dictionary.RLE.Simple16
372556
[12400000].teamId.runs
Array.Object.RLE.Simple16
366144
[12400000].hitDataTrajectory.values.discriminants.runs.values
Array.Object.Nullable.Enum.RLE.RLE.Simple16
359092
[12400000].outsStart.runs.values
Array.Object.RLE.RLE.Simple16
356528
[12400000].baseValueStart.runs.values
Array.Object.RLE.RLE.Simple16
349784
[12400000].parentTeamName.indices.runs
Array.Object.Dictionary.RLE.Simple16
349784
[12400000].parentTeamId.runs
Array.Object.RLE.Simple16
341588
[12400000].baseValueStart.values
Array.Object.RLE.Simple16
333332
[12400000].baseValueEnd.values
Array.Object.RLE.Simple16
315972
[12400000].outsEnd.runs.values
Array.Object.RLE.RLE.Simple16
305056
[12400000].runs.runs
Array.Object.RLE.Simple16
300360
[12400000].outsEnd.values
Array.Object.RLE.Simple16
292776
[12400000].parentTeamName.indices.values
Array.Object.Dictionary.RLE.Simple16
271748
[12400000].outsStart.values
Array.Object.RLE.Simple16
270036
[12400000].inPlayResult.values.discriminants.runs.values
Array.Object.Nullable.Enum.RLE.RLE.Simple16
249124
[12400000].hitDataTrajectory.values.discriminants.values
Array.Object.Nullable.Enum.RLE.Simple16
221728
[12400000].fieldedByPos.values.discriminants.runs.values
Array.Object.Nullable.Enum.RLE.RLE.Simple16
205095
[12400000].pitcherThrows.discriminants.runs
Array.Object.Enum.Bool RLE.Prefix Varint
205095
[12400000].pitcherThrowsDesc.values.discriminants.runs
Array.Object.Nullable.Enum.Bool RLE.Prefix Varint
204077
[12400000].leagueName.values.values
Array.Object.Nullable.RLE.UTF-8
191113
[12400000].venueCity.values
Array.Object.RLE.UTF-8
186684
[12400000].outsStart.runs.runs
Array.Object.RLE.RLE.Simple16
185620
[12400000].outsEnd.runs.runs
Array.Object.RLE.RLE.Simple16
178216
[12400000].hitDataTrajectory.values.discriminants.runs.runs
Array.Object.Nullable.Enum.RLE.RLE.Simple16
172820
[12400000].inPlayResult.values.discriminants.runs.runs
Array.Object.Nullable.Enum.RLE.RLE.Simple16
169568
[12400000].fieldedById.values.runs.values
Array.Object.Nullable.RLE.RLE.Simple16
168620
[12400000].fieldedByName.indices.runs.values
Array.Object.Dictionary.RLE.RLE.Simple16
164660
[12400000].baseValueStart.runs.runs
Array.Object.RLE.RLE.Simple16
163968
[12400000].ballsStart.runs.values
Array.Object.RLE.RLE.Simple16
153476
[12400000].fieldedByPos.values.discriminants.runs.runs
Array.Object.Nullable.Enum.RLE.RLE.Simple16
148692
[12400000].strikesStart.runs.values
Array.Object.RLE.RLE.Simple16
131372
[12400000].fieldedByName.indices.runs.runs
Array.Object.Dictionary.RLE.RLE.Simple16
130288
[12400000].fieldedById.values.runs.runs
Array.Object.Nullable.RLE.RLE.Simple16
116871
[12400000].fielderName.values
Array.Object.Dictionary.UTF-8
112456
[12400000].fieldedByName.values
Array.Object.Dictionary.UTF-8
111558
[12400000].fieldedByPos.opt.runs
Array.Object.Nullable.Bool RLE.Prefix Varint
111558
[12400000].fieldedById.opt.runs
Array.Object.Nullable.Bool RLE.Prefix Varint
90788
[12400000].ballsStart.values
Array.Object.RLE.Simple16
89565
[12400000].sportName.values
Array.Object.RLE.UTF-8
81796
[12400000].strikesStart.values
Array.Object.RLE.Simple16
77612
[12400000].runs.values
Array.Object.RLE.Simple16
74787
[12400000].gameDate.values
Array.Object.RLE.UTF-8
70674
[12400000].batterName.values
Array.Object.Dictionary.UTF-8
63152
[12400000].pitcherName.values
Array.Object.Dictionary.UTF-8
62480
[12400000].ballsStart.runs.runs
Array.Object.RLE.RLE.Simple16
59728
[12400000].hitDataContactQuality.values.discriminants.runs
Array.Object.Nullable.Enum.RLE.Simple16
58296
[12400000].strikesStart.runs.runs
Array.Object.RLE.RLE.Simple16
44140
[12400000].venueCapacity.values.values
Array.Object.Nullable.RLE.Prefix Varint
42997
[12400000].fielderDob.values
Array.Object.Dictionary.UTF-8
36689
[12400000].venueId.values
Array.Object.RLE.Prefix Varint
34740
[12400000].venueId.runs
Array.Object.RLE.Simple16
34720
[12400000].venueName.runs
Array.Object.RLE.Simple16
34708
[12400000].venueCity.runs
Array.Object.RLE.Simple16
33628
[12400000].venueCapacity.values.runs
Array.Object.Nullable.RLE.Simple16
32408
[12400000].sportCode.values
Array.Object.RLE.UTF-8
30980
[12400000].venueLeftLine.values.runs
Array.Object.Nullable.RLE.Simple16
30697
[12400000].venueRetrosheetId.values
Array.Object.RLE.UTF-8
30668
[12400000].venueRightLine.values.runs
Array.Object.Nullable.RLE.Simple16
29128
[12400000].venueCenter.values.runs
Array.Object.Nullable.RLE.Simple16
24186
[12400000].sportAbbr.values
Array.Object.RLE.UTF-8
23140
[12400000].venueLeftLine.values.values
Array.Object.Nullable.RLE.Simple16
22920
[12400000].venueRightLine.values.values
Array.Object.Nullable.RLE.Simple16
21856
[12400000].leagueName.values.runs
Array.Object.Nullable.RLE.Simple16
21312
[12400000].venueCenter.values.values
Array.Object.Nullable.RLE.Simple16
16604
[12400000].fielderMlbDebut.values
Array.Object.Dictionary.UTF-8
15320
[12400000].sportAbbr.runs
Array.Object.RLE.Simple16
15320
[12400000].sportId.runs
Array.Object.RLE.Simple16
15320
[12400000].sportCode.runs
Array.Object.RLE.Simple16
15320
[12400000].sportLevelOfPlay.runs
Array.Object.RLE.Simple16
15320
[12400000].sportName.runs
Array.Object.RLE.Simple16
14388
[12400000].gameDate.runs
Array.Object.RLE.Simple16
13779
[12400000].fielderCollegeName.values.values
Array.Object.Nullable.Dictionary.UTF-8
13312
[12400000].hitDataContactQuality.values.discriminants.values
Array.Object.Nullable.Enum.RLE.Simple16
10008
[12400000].venueRetrosheetId.runs
Array.Object.RLE.Simple16
8708
[12400000].venueRightCenter.values.runs
Array.Object.Nullable.RLE.Simple16
8564
[12400000].venueLeftCenter.values.runs
Array.Object.Nullable.RLE.Simple16
6992
[12400000].venueLeft.opt.runs
Array.Object.Nullable.Bool RLE.Prefix Varint
6416
[12400000].venueRightCenter.values.values
Array.Object.Nullable.RLE.Simple16
6272
[12400000].venueLeftCenter.values.values
Array.Object.Nullable.RLE.Simple16
6267
[12400000].venueRight.opt.runs
Array.Object.Nullable.Bool RLE.Prefix Varint
4956
[12400000].sportId.values
Array.Object.RLE.Simple16
4517
[12400000].teamName.values
Array.Object.Dictionary.UTF-8
3892
[12400000].venueRoof.values.discriminants.runs
Array.Object.Nullable.Enum.RLE.Simple16
3880
[12400000].venueLeft.values.runs
Array.Object.Nullable.RLE.Simple16
3520
[12400000].sportLevelOfPlay.values
Array.Object.RLE.Simple16
3157
[12400000].venueCenter.opt.runs
Array.Object.Nullable.Bool RLE.Prefix Varint
3096
[12400000].venueRight.values.runs
Array.Object.Nullable.RLE.Simple16
2898
[12400000].venueRightLine.opt.runs
Array.Object.Nullable.Bool RLE.Prefix Varint
2898
[12400000].venueLeftLine.opt.runs
Array.Object.Nullable.Bool RLE.Prefix Varint
2816
[12400000].venueLeft.values.values
Array.Object.Nullable.RLE.Simple16
2676
[12400000].venueSurface.values.discriminants.runs
Array.Object.Nullable.Enum.Bool RLE.Prefix Varint
2435
[12400000].fielderHeightStr.opt.runs
Array.Object.Nullable.Bool RLE.Prefix Varint
2344
[12400000].venueRoof.opt.runs
Array.Object.Nullable.Bool RLE.Prefix Varint
2329
[12400000].hitDataContactQuality.opt.runs
Array.Object.Nullable.Bool RLE.Prefix Varint
2268
[12400000].venueRight.values.values
Array.Object.Nullable.RLE.Simple16
2203
[12400000].hitDataSprayAngle.opt.runs
Array.Object.Nullable.Bool RLE.Prefix Varint
2203
[12400000].hitDataTrajectory.opt.runs
Array.Object.Nullable.Bool RLE.Prefix Varint
2203
[12400000].hitDataCalcDistance.opt.runs
Array.Object.Nullable.Bool RLE.Prefix Varint
1859
[12400000].batterName.indices.runs.runs
Array.Object.Dictionary.RLE.RLE.Prefix Varint
1820
[12400000].batter.runs.runs
Array.Object.RLE.RLE.Prefix Varint
1809
[12400000].fielderBirthCountry.opt.runs
Array.Object.Nullable.Bool RLE.Prefix Varint
1558
[12400000].venueCapacity.opt.runs
Array.Object.Nullable.Bool RLE.Prefix Varint
1481
[12400000].venueRightCenter.opt.runs
Array.Object.Nullable.Bool RLE.Prefix Varint
1481
[12400000].venueLeftCenter.opt.runs
Array.Object.Nullable.Bool RLE.Prefix Varint
1020
[12400000].batterName.indices.runs.values
Array.Object.Dictionary.RLE.RLE.Simple16
996
[12400000].batter.runs.values
Array.Object.RLE.RLE.Simple16
866
[12400000].fielderWeight.opt.runs
Array.Object.Nullable.Bool RLE.Prefix Varint
725
[12400000].parentTeamName.values
Array.Object.Dictionary.UTF-8
520
[12400000].venueRoof.values.discriminants.values
Array.Object.Nullable.Enum.RLE.Simple16
343
[12400000].fielderBirthCountry.values.values
Array.Object.Nullable.Dictionary.UTF-8
148
[12400000].fielderHeightStr.values.values
Array.Object.Nullable.Dictionary.UTF-8
22
[12400000].sportAffilliation.discriminants.runs
Array.Object.Enum.RLE.Prefix Varint
9
[12400000].fielderThrowsDesc.opt.runs
Array.Object.Nullable.Bool RLE.Prefix Varint
9
[12400000].fielderThrowsCode.opt.runs
Array.Object.Nullable.Bool RLE.Prefix Varint
4
[12400000].inPlayResult.opt.runs
Array.Object.Nullable.Bool RLE.Prefix Varint
4
[12400000].venueSurface.opt.runs
Array.Object.Nullable.Bool RLE.Prefix Varint
4
[12400000].sportAffilliation.discriminants.values
Array.Object.Enum.RLE.Simple16
4
[12400000].leagueName.opt.runs
Array.Object.Nullable.Bool RLE.Prefix Varint
4
[12400000].batterBatsDesc.opt.runs
Array.Object.Nullable.Bool RLE.Prefix Varint
4
[12400000].gameType.discriminants.runs
Array.Object.Enum.Bool RLE.Prefix Varint
4
[12400000].pitcherThrowsDesc.opt.runs
Array.Object.Nullable.Bool RLE.Prefix Varint
Largest by type:
90x 158310524 @ Simple16
41x 48837477 @ Prefix Varint
2x 15366783 @ Gorilla
4x 6200000 @ Packed Boolean
19x 1474283 @ UTF-8
Other: 2623
Total: 230191690
Nice. This is less than 1/3 the size of the last test. This is also a significant improvement over Tableau Hyper now. As expected, these changes made significant improvements to one of my benchmarks as well. You'd be surprised how common it is to have a CSV with very repetitive data.
There are still major compression improvements coming down the pike. For example, delta compression might make a big difference for the fielder column (currently the largest) What does the data in that column look like?
Let's remember to be cognizant of the fact that the binary format for Tree-Buf is still changing in ways that may be backward incompatible. It seems like data persistence is an important part of your use-case so this could cause problems for you when you migrate to new versions. When you are actually ready to start saving files let's coordinate about what to do about that.
Looking forward to seeing how this fares on the main data set.
This is an epic ticket to support the requirements of BOSS - to store all of MLB in memory.
Based on discussions in #2, we've identified the following needs -
Considering that #5 is probably a ways off and that your source data comes from CSV I'd recommend in the short term to implement a windowing system on top of Tree-Buf. Internally, the iterator would have a
Vec<Vec<u8>>
where each of the internal buffers is a standalone Tree-Buf file containing a large-ish (64k?) count of rows. Perhaps each buffer is just one GamePK set. I'm not that familiar with your data and what is ideal here. Feel free to experiment.The iterator would deserialize one block at a time, yielding items from each buffer until exhausted (take a look at
std::vec::IntoIter
), then move onto the next file. This is essentially what Tree-Buf would do internally when such an iterator is implemented, but since Tree-Buf supports much more complicated structures than columnar data like CSV it will take a lot longer to get the general-purpose system that Tree-Buf would need.It should be pretty simple to take this and make an append-only format for your needs that internally is a list of Tree-Buf files.