That3Percent / tree-buf

An experimental serialization system written in Rust
MIT License
256 stars 8 forks source link

Support BOSS baseball data aggregator #6

Open That3Percent opened 4 years ago

That3Percent commented 4 years ago

This is an epic ticket to support the requirements of BOSS - to store all of MLB in memory.

Based on discussions in #2, we've identified the following needs -

Considering that #5 is probably a ways off and that your source data comes from CSV I'd recommend in the short term to implement a windowing system on top of Tree-Buf. Internally, the iterator would have a Vec<Vec<u8>> where each of the internal buffers is a standalone Tree-Buf file containing a large-ish (64k?) count of rows. Perhaps each buffer is just one GamePK set. I'm not that familiar with your data and what is ideal here. Feel free to experiment.

The iterator would deserialize one block at a time, yielding items from each buffer until exhausted (take a look at std::vec::IntoIter), then move onto the next file. This is essentially what Tree-Buf would do internally when such an iterator is implemented, but since Tree-Buf supports much more complicated structures than columnar data like CSV it will take a lot longer to get the general-purpose system that Tree-Buf would need.

It should be pretty simple to take this and make an append-only format for your needs that internally is a list of Tree-Buf files.

elibenporat commented 4 years ago

I think I'll have a clearer picture of #5 once #2 is resolved (assuming no other blockers for me), so I'm glad to hear that #5 is a ways off, as I don't want to lead you down a path that isn't actually needed.

In regards to append only/separate files - I'll note that my data has about 124M records in one data set (9 records for every ball hit into play in any affiliated baseball game since 2005) and the other has about 55M records (every pitch since 2005). I don't know what the implications of this are, or if it even matters? There are a lot of natural ways to chunk the data (years, months, days, MLB vs AAA vs AA etc.) so this is something to explore once I get it to work.

That3Percent commented 4 years ago

I think you should be unblocked now with #2 closed.

With the way Tree-Buf is implemented now, chunking is a great idea. Hopefully, this can be automatic someday. But with your need to append data in batches manual chunking just makes sense. Keep me up-to-date on what chunking strategy works the best with your data set to inform the design of auto-chunking in #4.

Once you've got a file written, you can use an internal API to get diagnostics on the breakdown of the size of data per column:

let tree = tree_buf::internal::read_root(&tb_bytes);
dbg!(tree.unwrap());

This will help inform us of what is working well for the compression and what future improvements will have the biggest bang for the buck. My bet is that RLE compression in #7 will make a big difference.

elibenporat commented 4 years ago

It works! I included the assert_eq! from your Readme and that passed.

I tested it on a subset of the data (about 10% of 124 Million records). Converting to tree-buf took about 45 seconds (which is reasonable from my perspective).

Original CSV: 48 GB (so it pulled in about 4.8GB) Tableau .hyper: 6.2 GB total (when it converted the entire CSV) tree-buf: 2.7 GB on the sample set (compared to the roughly 4.8 GB sample)

Compression is a WIP I assume?

Edit: forgot to run the diagnostics, let me do that now

elibenporat commented 4 years ago
tree.unwrap() = Array {
    len: 12400000,
    values: Object {
        fields: {
            "teamId": Integer(
                ArrayInteger {
                    bytes: Bytes(
                        18719224,
                    ),
                    encoding: Simple16,
                },
            ),
            "fielderHeightIn": Integer(
                ArrayInteger {
                    bytes: Bytes(
                        12400000,
                    ),
                    encoding: Simple16,
                },
            ),
            "venueLeftCenter": Nullable {
                opt: Bytes(
                    1550000,
                ),
                values: Integer(
                    ArrayInteger {
                        bytes: Bytes(
                            3655860,
                        ),
                        encoding: Simple16,
                    },
                ),
            },
            "batter": Integer(
                ArrayInteger {
                    bytes: Bytes(
                        37200000,
                    ),
                    encoding: PrefixVarInt,
                },
            ),
            "gameType": Enum {
                discriminants: Integer(
                    ArrayInteger {
                        bytes: Bytes(
                            0,
                        ),
                        encoding: Simple16,
                    },
                ),
                variants: [
                    ArrayEnumVariant {
                        ident: "r",
                        data: Void,
                    },
                ],
            },
            "venueId": Integer(
                ArrayInteger {
                    bytes: Bytes(
                        22177800,
                    ),
                    encoding: Simple16,
                },
            ),
            "runs": Integer(
                ArrayInteger {
                    bytes: Bytes(
                        1877132,
                    ),
                    encoding: Simple16,
                },
            ),
            "batterBatsDesc": Nullable {
                opt: Bytes(
                    1550000,
                ),
                values: Enum {
                    discriminants: Integer(
                        ArrayInteger {
                            bytes: Bytes(
                                1771432,
                            ),
                            encoding: Simple16,
                        },
                    ),
                    variants: [
                        ArrayEnumVariant {
                            ident: "left",
                            data: Void,
                        },
                        ArrayEnumVariant {
                            ident: "right",
                            data: Void,
                        },
                    ],
                },
            },
            "hitDataExitVelocity": Void,
            "fieldedById": Nullable {
                opt: Bytes(
                    1550000,
                ),
                values: Integer(
                    ArrayInteger {
                        bytes: Bytes(
                            35991429,
                        ),
                        encoding: PrefixVarInt,
                    },
                ),
            },
            "fielderBirthCountry": Nullable {
                opt: Bytes(
                    1550000,
                ),
                values: String(
                    Bytes(
                        81080372,
                    ),
                ),
            },
            "venueRightCenter": Nullable {
                opt: Bytes(
                    1550000,
                ),
                values: Integer(
                    ArrayInteger {
                        bytes: Bytes(
                            3655860,
                        ),
                        encoding: Simple16,
                    },
                ),
            },
            "pitcherThrowsDesc": Nullable {
                opt: Bytes(
                    1550000,
                ),
                values: Enum {
                    discriminants: Integer(
                        ArrayInteger {
                            bytes: Bytes(
                                1771432,
                            ),
                            encoding: Simple16,
                        },
                    ),
                    variants: [
                        ArrayEnumVariant {
                            ident: "left",
                            data: Void,
                        },
                        ArrayEnumVariant {
                            ident: "right",
                            data: Void,
                        },
                    ],
                },
            },
            "venueSurface": Nullable {
                opt: Bytes(
                    1550000,
                ),
                values: Enum {
                    discriminants: Integer(
                        ArrayInteger {
                            bytes: Bytes(
                                1771432,
                            ),
                            encoding: Simple16,
                        },
                    ),
                    variants: [
                        ArrayEnumVariant {
                            ident: "grass",
                            data: Void,
                        },
                        ArrayEnumVariant {
                            ident: "artificial",
                            data: Void,
                        },
                    ],
                },
            },
            "fielderDraftPickNumber": Nullable {
                opt: Bytes(
                    1550000,
                ),
                values: Integer(
                    ArrayInteger {
                        bytes: Bytes(
                            13676816,
                        ),
                        encoding: Simple16,
                    },
                ),
            },
            "fielderThrowsCode": Nullable {
                opt: Bytes(
                    1550000,
                ),
                values: Enum {
                    discriminants: Integer(
                        ArrayInteger {
                            bytes: Bytes(
                                1771428,
                            ),
                            encoding: Simple16,
                        },
                    ),
                    variants: [
                        ArrayEnumVariant {
                            ident: "r",
                            data: Void,
                        },
                        ArrayEnumVariant {
                            ident: "l",
                            data: Void,
                        },
                    ],
                },
            },
            "pitcherName": String(
                Bytes(
                    169118682,
                ),
            ),
            "sportCode": String(
                Bytes(
                    49600000,
                ),
            ),
            "sportName": String(
                Bytes(
                    113485368,
                ),
            ),
            "parentTeamId": Integer(
                ArrayInteger {
                    bytes: Bytes(
                        13837484,
                    ),
                    encoding: Simple16,
                },
            ),
            "pitcher": Integer(
                ArrayInteger {
                    bytes: Bytes(
                        37200000,
                    ),
                    encoding: PrefixVarInt,
                },
            ),
            "hitDataCalcDistance": Nullable {
                opt: Bytes(
                    1550000,
                ),
                values: Float(
                    DoubleGorilla(
                        Bytes(
                            7561831,
                        ),
                    ),
                ),
            },
            "venueName": String(
                Bytes(
                    236053544,
                ),
            ),
            "parentTeamName": String(
                Bytes(
                    203484436,
                ),
            ),
            "venueRightLine": Nullable {
                opt: Bytes(
                    1550000,
                ),
                values: Integer(
                    ArrayInteger {
                        bytes: Bytes(
                            15794304,
                        ),
                        encoding: Simple16,
                    },
                ),
            },
            "hitDataSprayAngle": Nullable {
                opt: Bytes(
                    1550000,
                ),
                values: Float(
                    DoubleGorilla(
                        Bytes(
                            7804952,
                        ),
                    ),
                ),
            },
            "venueRetrosheetId": String(
                Bytes(
                    25633194,
                ),
            ),
            "outsEnd": Integer(
                ArrayInteger {
                    bytes: Bytes(
                        3171240,
                    ),
                    encoding: Simple16,
                },
            ),
            "fielderWeight": Nullable {
                opt: Bytes(
                    1550000,
                ),
                values: Integer(
                    ArrayInteger {
                        bytes: Bytes(
                            16532812,
                        ),
                        encoding: Simple16,
                    },
                ),
            },
            "fielderName": String(
                Bytes(
                    169622188,
                ),
            ),
            "fielderCollegeName": Nullable {
                opt: Bytes(
                    1550000,
                ),
                values: String(
                    Bytes(
                        94095404,
                    ),
                ),
            },
            "batterBats": Enum {
                discriminants: Integer(
                    ArrayInteger {
                        bytes: Bytes(
                            1771432,
                        ),
                        encoding: Simple16,
                    },
                ),
                variants: [
                    ArrayEnumVariant {
                        ident: "l",
                        data: Void,
                    },
                    ArrayEnumVariant {
                        ident: "r",
                        data: Void,
                    },
                ],
            },
            "fieldedByPos": Nullable {
                opt: Bytes(
                    1550000,
                ),
                values: Enum {
                    discriminants: Integer(
                        ArrayInteger {
                            bytes: Bytes(
                                4796112,
                            ),
                            encoding: Simple16,
                        },
                    ),
                    variants: [
                        ArrayEnumVariant {
                            ident: "secondBase",
                            data: Void,
                        },
                        ArrayEnumVariant {
                            ident: "centerField",
                            data: Void,
                        },
                        ArrayEnumVariant {
                            ident: "thirdBase",
                            data: Void,
                        },
                        ArrayEnumVariant {
                            ident: "rightField",
                            data: Void,
                        },
                        ArrayEnumVariant {
                            ident: "leftField",
                            data: Void,
                        },
                        ArrayEnumVariant {
                            ident: "shortStop",
                            data: Void,
                        },
                        ArrayEnumVariant {
                            ident: "firstBase",
                            data: Void,
                        },
                        ArrayEnumVariant {
                            ident: "catcher",
                            data: Void,
                        },
                        ArrayEnumVariant {
                            ident: "pitcher",
                            data: Void,
                        },
                    ],
                },
            },
            "gameDate": String(
                Bytes(
                    120400104,
                ),
            ),
            "sportAbbr": String(
                Bytes(
                    39705403,
                ),
            ),
            "baseValueStart": Integer(
                ArrayInteger {
                    bytes: Bytes(
                        2935120,
                    ),
                    encoding: Simple16,
                },
            ),
            "hitDataTrajectory": Nullable {
                opt: Bytes(
                    1550000,
                ),
                values: Enum {
                    discriminants: Integer(
                        ArrayInteger {
                            bytes: Bytes(
                                2483040,
                            ),
                            encoding: Simple16,
                        },
                    ),
                    variants: [
                        ArrayEnumVariant {
                            ident: "groundBall",
                            data: Void,
                        },
                        ArrayEnumVariant {
                            ident: "flyBall",
                            data: Void,
                        },
                        ArrayEnumVariant {
                            ident: "lineDrive",
                            data: Void,
                        },
                        ArrayEnumVariant {
                            ident: "popUp",
                            data: Void,
                        },
                        ArrayEnumVariant {
                            ident: "unknown",
                            data: Void,
                        },
                    ],
                },
            },
            "fielderMlbDebut": String(
                Bytes(
                    66528692,
                ),
            ),
            "leagueName": Nullable {
                opt: Bytes(
                    1550000,
                ),
                values: String(
                    Bytes(
                        220077042,
                    ),
                ),
            },
            "sportLevelOfPlay": Integer(
                ArrayInteger {
                    bytes: Bytes(
                        3355136,
                    ),
                    encoding: Simple16,
                },
            ),
            "hitDataTotalDistance": Void,
            "fieldedByName": String(
                Bytes(
                    164744437,
                ),
            ),
            "fielderHeightStr": Nullable {
                opt: Bytes(
                    1550000,
                ),
                values: String(
                    Bytes(
                        76926492,
                    ),
                ),
            },
            "venueRight": Nullable {
                opt: Bytes(
                    1550000,
                ),
                values: Integer(
                    ArrayInteger {
                        bytes: Bytes(
                            1410348,
                        ),
                        encoding: Simple16,
                    },
                ),
            },
            "hitDataLaunchAngle": Void,
            "ballsStart": Integer(
                ArrayInteger {
                    bytes: Bytes(
                        2064116,
                    ),
                    encoding: Simple16,
                },
            ),
            "hitDataContactQuality": Nullable {
                opt: Bytes(
                    1550000,
                ),
                values: Enum {
                    discriminants: Integer(
                        ArrayInteger {
                            bytes: Bytes(
                                1794092,
                            ),
                            encoding: Simple16,
                        },
                    ),
                    variants: [
                        ArrayEnumVariant {
                            ident: "medium",
                            data: Void,
                        },
                        ArrayEnumVariant {
                            ident: "soft",
                            data: Void,
                        },
                        ArrayEnumVariant {
                            ident: "hard",
                            data: Void,
                        },
                    ],
                },
            },
            "pitcherThrows": Enum {
                discriminants: Integer(
                    ArrayInteger {
                        bytes: Bytes(
                            1771432,
                        ),
                        encoding: Simple16,
                    },
                ),
                variants: [
                    ArrayEnumVariant {
                        ident: "l",
                        data: Void,
                    },
                    ArrayEnumVariant {
                        ident: "r",
                        data: Void,
                    },
                ],
            },
            "strikesStart": Integer(
                ArrayInteger {
                    bytes: Bytes(
                        2074300,
                    ),
                    encoding: Simple16,
                },
            ),
            "fielder": Integer(
                ArrayInteger {
                    bytes: Bytes(
                        37200000,
                    ),
                    encoding: PrefixVarInt,
                },
            ),
            "baseValueEnd": Integer(
                ArrayInteger {
                    bytes: Bytes(
                        3177616,
                    ),
                    encoding: Simple16,
                },
            ),
            "venueRoof": Nullable {
                opt: Bytes(
                    1550000,
                ),
                values: Enum {
                    discriminants: Integer(
                        ArrayInteger {
                            bytes: Bytes(
                                1170912,
                            ),
                            encoding: Simple16,
                        },
                    ),
                    variants: [
                        ArrayEnumVariant {
                            ident: "open",
                            data: Void,
                        },
                        ArrayEnumVariant {
                            ident: "retractable",
                            data: Void,
                        },
                        ArrayEnumVariant {
                            ident: "dome",
                            data: Void,
                        },
                    ],
                },
            },
            "venueLeftLine": Nullable {
                opt: Bytes(
                    1550000,
                ),
                values: Integer(
                    ArrayInteger {
                        bytes: Bytes(
                            15794304,
                        ),
                        encoding: Simple16,
                    },
                ),
            },
            "venueLeft": Nullable {
                opt: Bytes(
                    1550000,
                ),
                values: Integer(
                    ArrayInteger {
                        bytes: Bytes(
                            1742496,
                        ),
                        encoding: Simple16,
                    },
                ),
            },
            "sportAffilliation": Enum {
                discriminants: Integer(
                    ArrayInteger {
                        bytes: Bytes(
                            1773876,
                        ),
                        encoding: Simple16,
                    },
                ),
                variants: [
                    ArrayEnumVariant {
                        ident: "mlb",
                        data: Void,
                    },
                    ArrayEnumVariant {
                        ident: "minors",
                        data: Void,
                    },
                    ArrayEnumVariant {
                        ident: "unaffiliated",
                        data: Void,
                    },
                ],
            },
            "venueCenter": Nullable {
                opt: Bytes(
                    1550000,
                ),
                values: Integer(
                    ArrayInteger {
                        bytes: Bytes(
                            15700728,
                        ),
                        encoding: Simple16,
                    },
                ),
            },
            "venueCity": String(
                Bytes(
                    118474096,
                ),
            ),
            "fielderDob": String(
                Bytes(
                    123625176,
                ),
            ),
            "doublePlayOpp": Boolean(
                Bytes(
                    1550000,
                ),
            ),
            "batterName": String(
                Bytes(
                    169605188,
                ),
            ),
            "venueCapacity": Nullable {
                opt: Bytes(
                    1550000,
                ),
                values: Integer(
                    ArrayInteger {
                        bytes: Bytes(
                            26901095,
                        ),
                        encoding: PrefixVarInt,
                    },
                ),
            },
            "sportId": Integer(
                ArrayInteger {
                    bytes: Bytes(
                        6263900,
                    ),
                    encoding: Simple16,
                },
            ),
            "inPlayResult": Nullable {
                opt: Bytes(
                    1550000,
                ),
                values: Enum {
                    discriminants: Integer(
                        ArrayInteger {
                            bytes: Bytes(
                                4651356,
                            ),
                            encoding: Simple16,
                        },
                    ),
                    variants: [
                        ArrayEnumVariant {
                            ident: "groundOut",
                            data: Void,
                        },
                        ArrayEnumVariant {
                            ident: "single",
                            data: Void,
                        },
                        ArrayEnumVariant {
                            ident: "flyOut",
                            data: Void,
                        },
                        ArrayEnumVariant {
                            ident: "forceOut",
                            data: Void,
                        },
                        ArrayEnumVariant {
                            ident: "double",
                            data: Void,
                        },
                        ArrayEnumVariant {
                            ident: "sacFly",
                            data: Void,
                        },
                        ArrayEnumVariant {
                            ident: "fieldError",
                            data: Void,
                        },
                        ArrayEnumVariant {
                            ident: "doublePlay",
                            data: Void,
                        },
                        ArrayEnumVariant {
                            ident: "popOut",
                            data: Void,
                        },
                        ArrayEnumVariant {
                            ident: "lineOut",
                            data: Void,
                        },
                        ArrayEnumVariant {
                            ident: "homeRun",
                            data: Void,
                        },
                        ArrayEnumVariant {
                            ident: "triple",
                            data: Void,
                        },
                        ArrayEnumVariant {
                            ident: "buntPopOut",
                            data: Void,
                        },
                        ArrayEnumVariant {
                            ident: "sacBunt",
                            data: Void,
                        },
                        ArrayEnumVariant {
                            ident: "batterInterference",
                            data: Void,
                        },
                        ArrayEnumVariant {
                            ident: "fieldersChoice",
                            data: Void,
                        },
                        ArrayEnumVariant {
                            ident: "buntGroundOut",
                            data: Void,
                        },
                        ArrayEnumVariant {
                            ident: "fanInterference",
                            data: Void,
                        },
                        ArrayEnumVariant {
                            ident: "triplePlay",
                            data: Void,
                        },
                        ArrayEnumVariant {
                            ident: "sacFlyDoublePlay",
                            data: Void,
                        },
                        ArrayEnumVariant {
                            ident: "other",
                            data: Void,
                        },
                        ArrayEnumVariant {
                            ident: "strikeOut",
                            data: Void,
                        },
                        ArrayEnumVariant {
                            ident: "pitchingSubstitution",
                            data: Void,
                        },
                        ArrayEnumVariant {
                            ident: "walk",
                            data: Void,
                        },
                        ArrayEnumVariant {
                            ident: "catcherInterference",
                            data: Void,
                        },
                        ArrayEnumVariant {
                            ident: "hitByPitch",
                            data: Void,
                        },
                        ArrayEnumVariant {
                            ident: "intentionalWalk",
                            data: Void,
                        },
                    ],
                },
            },
            "outsStart": Integer(
                ArrayInteger {
                    bytes: Bytes(
                        2696628,
                    ),
                    encoding: Simple16,
                },
            ),
            "position": Enum {
                discriminants: Integer(
                    ArrayInteger {
                        bytes: Bytes(
                            5511164,
                        ),
                        encoding: Simple16,
                    },
                ),
                variants: [
                    ArrayEnumVariant {
                        ident: "catcher",
                        data: Void,
                    },
                    ArrayEnumVariant {
                        ident: "firstBase",
                        data: Void,
                    },
                    ArrayEnumVariant {
                        ident: "secondBase",
                        data: Void,
                    },
                    ArrayEnumVariant {
                        ident: "thirdBase",
                        data: Void,
                    },
                    ArrayEnumVariant {
                        ident: "shortStop",
                        data: Void,
                    },
                    ArrayEnumVariant {
                        ident: "leftField",
                        data: Void,
                    },
                    ArrayEnumVariant {
                        ident: "rightField",
                        data: Void,
                    },
                    ArrayEnumVariant {
                        ident: "centerField",
                        data: Void,
                    },
                    ArrayEnumVariant {
                        ident: "pitcher",
                        data: Void,
                    },
                ],
            },
            "teamName": String(
                Bytes(
                    225266477,
                ),
            ),
            "fielderThrowsDesc": Nullable {
                opt: Bytes(
                    1550000,
                ),
                values: Enum {
                    discriminants: Integer(
                        ArrayInteger {
                            bytes: Bytes(
                                1771428,
                            ),
                            encoding: Simple16,
                        },
                    ),
                    variants: [
                        ArrayEnumVariant {
                            ident: "right",
                            data: Void,
                        },
                        ArrayEnumVariant {
                            ident: "left",
                            data: Void,
                        },
                    ],
                },
            },
        },
    },
}
That3Percent commented 4 years ago

This is great progress!

Yes, you are right that the compression (and everything else) is WIP. There are all kinds of possible improvements - but one of the principles of Tree-Buf is that its design is data-driven. This data identifies what compression features will give the biggest bang for the buck.

A few points stand out:

There's probably a bunch more insight here, but this is enough to be busy for a while and re-evaluate after these are implemented. I'm in the middle of writing the Gorilla encoder from scratch to be a lot faster. Next I will spin up issues for all of these.

That3Percent commented 4 years ago

Added:

9, #11, #12, #13

elibenporat commented 4 years ago

Excited to see the results once dictionary compression is in, that should provide huge wins for this data set which is highly repetitive (mostly names). Honestly, just being able to slap a few derives and then read/write is incredible ergonomically and opens up a lot of possibilities for me.

I'll see if I can get the main data set (much wider and bulkier, less repetitive) to work as well. Do you want me to post the diagnostics in #8?

That3Percent commented 4 years ago

Let's wait until the new size diagnostics are available, then post a sample of the main data set in this issue. It's easier to track the BOSS story here since that issue is going to be closed.

That3Percent commented 4 years ago

We now have dictionary compression and the new size diagnostics API on master.

You can now use:

let sizes = tree_buf::experimental::stats::size_breakdown(&tb_bytes);
println!("{}", sizes.unwrap());

And it will print something like...

Largest by path:
        32000 U8 Fixed data.orders.id
        5000 Prefix Varint data.orders.price
        5000 Prefix Varint data.orders.createdAt
        2836 UTF-8 data.orders.nft.wearable.representationId.values
        2452 UTF-8 data.orders.nft.wearable.name.values
        1014 Prefix Varint data.orders.nft.wearable.representationId.indices
        1013 Prefix Varint data.orders.nft.wearable.name.indices
        1000 Prefix Varint data.orders.nft.wearable.collection.indices
        420 Simple16 data.orders.nft.wearable.category
        288 Simple16 data.orders.nft.wearable.rarity
        288 Simple16 data.orders.nft.wearable.bodyShapes.len
        272 Simple16 data.orders.nft.wearable.bodyShapes.values
        268 Simple16 data.orders.status
        85 UTF-8 data.orders.nft.wearable.collection.values
        0 Simple16 data.orders.nft.wearable.owner.mana

Largest by type:
         1x 32000 @ U8 Fixed
         5x 13027 @ Prefix Varint
         3x 5373 @ UTF-8
         6x 1536 @ Simple16

Other: 403
Total: 52339

I expect that the dictionary compression, while not yet perfect, will still be a huge reduction in the size of the file.

elibenporat commented 4 years ago

You work fast!

I get the following error when compiling against #217a8b22:

error[E0599]: no associated item named `MAX` found for type `usize` in the current scope
  --> C:\Users\Eli\.cargo\git\checkouts\tree-buf-402f6dec423c055a\217a8b2\tree-buf\src\experimental\stats.rs:53:40
   |
53 |         by_type.sort_by_key(|i| usize::MAX - i.1.size);
   |                                        ^^^ associated item not found in `usize`
   |
help: you are looking for the module in `std`, not the primitive type
   |
53 |         by_type.sort_by_key(|i| std::usize::MAX - i.1.size);
   |                                 ^^^^^^^^^^^^^^^

error: aborting due to 2 previous errors
That3Percent commented 4 years ago

Run rustup update. The value usize::MAX was made available as of Rust version 1.43.0

elibenporat commented 4 years ago

Must have missed that update. We're down to 735MB from the 2.7GB version tested last time (CSV is about 4.7GB). This is very close to the level of compression Tableau got. Diagnostics in next comment.

elibenporat commented 4 years ago
Largest by path:
        37200000 Prefix Varint fielder
        37200000 Prefix Varint batter
        37200000 Prefix Varint pitcher
        35991429 Prefix Varint fieldedById.values
        26901095 Prefix Varint venueCapacity.values
        24242995 Prefix Varint fielderName.indices
        24139292 Prefix Varint pitcherName.indices
        24079353 Prefix Varint fielderDob.indices
        24017288 Prefix Varint batterName.indices
        23869379 Prefix Varint fielderMlbDebut.indices
        23597718 Prefix Varint fieldedByName.indices
        22177800 Simple16 venueId
        19310669 Prefix Varint gameDate.indices
        18719224 Simple16 teamId
        16532812 Simple16 fielderWeight.values
        16158292 Prefix Varint teamName.indices
        16082701 Prefix Varint venueName.indices
        15794304 Simple16 venueLeftLine.values
        15794304 Simple16 venueRightLine.values
        15709408 Prefix Varint venueCity.indices
        15700728 Simple16 venueCenter.values
        13837484 Simple16 parentTeamId
        13676816 Simple16 fielderDraftPickNumber.values
        12400000 Prefix Varint sportName.indices
        12400000 Simple16 fielderHeightIn
        12400000 Prefix Varint sportCode.indices
        12400000 Prefix Varint sportAbbr.indices
        12400000 Prefix Varint leagueName.values.indices
        12399166 Prefix Varint fielderBirthCountry.values.indices
        12398834 Prefix Varint fielderHeightStr.values.indices
        12398748 Prefix Varint parentTeamName.indices
        12366879 Prefix Varint venueRetrosheetId.indices
        9916026 Prefix Varint fielderCollegeName.values.indices
        7804952 Gorilla hitDataSprayAngle.values
        7561831 Gorilla hitDataCalcDistance.values
        6263900 Simple16 sportId
        5511164 Simple16 position
        4796112 Simple16 fieldedByPos.values
        4651356 Simple16 inPlayResult.values
        3655860 Simple16 venueLeftCenter.values
        3655860 Simple16 venueRightCenter.values
        3355136 Simple16 sportLevelOfPlay
        3177616 Simple16 baseValueEnd
        3171240 Simple16 outsEnd
        2935120 Simple16 baseValueStart
        2696628 Simple16 outsStart
        2483040 Simple16 hitDataTrajectory.values
        2074300 Simple16 strikesStart
        2064116 Simple16 ballsStart
        1877132 Simple16 runs
        1794092 Simple16 hitDataContactQuality.values
        1773876 Simple16 sportAffilliation
        1771432 Simple16 batterBats
        1771432 Simple16 pitcherThrowsDesc.values
        1771432 Simple16 batterBatsDesc.values
        1771432 Simple16 venueSurface.values
        1771432 Simple16 pitcherThrows
        1771428 Simple16 fielderThrowsDesc.values
        1771428 Simple16 fielderThrowsCode.values
        1742496 Simple16 venueLeft.values
        1550000 Packed Boolean fielderDraftPickNumber
        1550000 Packed Boolean fielderCollegeName
        1550000 Packed Boolean leagueName
        1550000 Packed Boolean fielderBirthCountry
        1550000 Packed Boolean fielderThrowsDesc
        1550000 Packed Boolean fielderHeightStr
        1550000 Packed Boolean venueSurface
        1550000 Packed Boolean hitDataContactQuality
        1550000 Packed Boolean batterBatsDesc
        1550000 Packed Boolean doublePlayOpp
        1550000 Packed Boolean venueRoof
        1550000 Packed Boolean hitDataTrajectory
        1550000 Packed Boolean venueLeftCenter
        1550000 Packed Boolean pitcherThrowsDesc
        1550000 Packed Boolean fielderWeight
        1550000 Packed Boolean venueLeft
        1550000 Packed Boolean venueCenter
        1550000 Packed Boolean hitDataSprayAngle
        1550000 Packed Boolean venueRight
        1550000 Packed Boolean fieldedById
        1550000 Packed Boolean hitDataCalcDistance
        1550000 Packed Boolean fielderThrowsCode
        1550000 Packed Boolean venueCapacity
        1550000 Packed Boolean venueRightCenter
        1550000 Packed Boolean venueRightLine
        1550000 Packed Boolean venueLeftLine
        1550000 Packed Boolean fieldedByPos
        1550000 Packed Boolean inPlayResult
        1410348 Simple16 venueRight.values
        1170912 Simple16 venueRoof.values
        116871 UTF-8 fielderName.values
        112456 UTF-8 fieldedByName.values
        70674 UTF-8 batterName.values
        63152 UTF-8 pitcherName.values
        42997 UTF-8 fielderDob.values
        16604 UTF-8 fielderMlbDebut.values
        13779 UTF-8 fielderCollegeName.values.values
        4517 UTF-8 teamName.values
        4495 UTF-8 venueName.values
        4297 UTF-8 gameDate.values
        2003 UTF-8 venueCity.values
        725 UTF-8 parentTeamName.values
        398 UTF-8 leagueName.values.values
        343 UTF-8 fielderBirthCountry.values.values
        217 UTF-8 venueRetrosheetId.values
        148 UTF-8 fielderHeightStr.values.values
        113 UTF-8 sportName.values
        36 UTF-8 sportCode.values
        29 UTF-8 sportAbbr.values
        0 Simple16 gameType

Largest by type:
         24x 494779272 @ Prefix Varint
         37x 217293792 @ Simple16
         28x 43400000 @ Packed Boolean
         2x 15366783 @ Gorilla
         19x 453854 @ UTF-8

Other: 2227
Total: 771295928
That3Percent commented 4 years ago

Pushed a couple of improvements:

I don't expect these to move the needle too much but it should help. I've taken on a bit too much technical debt. Adding other features is running into confusing problems and hacks (actually these too). So, this will be a good time to checkpoint and let you know this is as good as it's going to get for a short while until I pay off some of that debt. Shouldn't take too long.

elibenporat commented 4 years ago

It goes without saying, but please don't feel any pressure on account of me. Take your time and enjoy the process.

I think https://github.com/That3Percent/tree-buf/commit/1afd77827bb74b4ae20c7c13c154037fda796a67 introduced a bug. File de-compressed fine for https://github.com/That3Percent/tree-buf/commit/217a8b229f130b70f27a79975389c9d19d9cd186 but throws the following error for all revisions after that:

thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: InvalidFormat'

Edit: realized it might be helpful to have my code in here:

println!("Converting to treebuf...");
let bytes = write(&defense_data);

println!("Checking file integrity...");
let copy: Vec<boss::defense::Defense> = read(&bytes).unwrap();
assert_eq!(&copy, &defense_data);
That3Percent commented 4 years ago

Thanks for understanding! I've been setting up this "restaurant" for 6 months - aging the spices and slow cooking the sauces. You're my first customer so I want to make sure you are happy!

Pretty sure I fixed the problem, and there are some new compression features for you as well:

Still going to need to do some cleanup on my end soon because it's a hack that fixes the problem.

elibenporat commented 4 years ago

At this point, it's already passed the "MVP" stage for what I need, so anything else is (slow cooked) gravy. I know how important having a good test data set is, especially one that hasn't been developed against, so this is my way of contributing.

Updated diagnostics to follow, looks like RLE had a rather large benefit in my data set, but that is likely due to the very repetitive nature of my data (doubt other sets will be like this). When I get around to getting the main data set to work, I'll post that.

elibenporat commented 4 years ago

Same sample size as above:

Largest by path:
        37200000
           [12400000].fielder
           Array.Object.Prefix Varint
        23841192
           [12400000].fielderName.indices
           Array.Object.Dictionary.Simple16
        23704360
           [12400000].fielderDob.indices
           Array.Object.Dictionary.Simple16
        23389660
           [12400000].fielderMlbDebut.indices
           Array.Object.Dictionary.Simple16
        16532812
           [12400000].fielderWeight.values
           Array.Object.Nullable.Simple16
        13676816
           [12400000].fielderDraftPickNumber.values
           Array.Object.Nullable.Simple16
        12400000
           [12400000].fielderHeightIn
           Array.Object.Simple16
        8845128
           [12400000].fielderCollegeName.values.indices
           Array.Object.Nullable.Dictionary.Simple16
        7804952
           [12400000].hitDataSprayAngle.values
           Array.Object.Nullable.Gorilla
        7561831
           [12400000].hitDataCalcDistance.values
           Array.Object.Nullable.Gorilla
        6510716
           [12400000].fielderHeightStr.values.indices
           Array.Object.Nullable.Dictionary.Simple16
        5540052
           [12400000].fielderBirthCountry.values.indices
           Array.Object.Nullable.Dictionary.Simple16
        5511164
           [12400000].position.discriminants
           Array.Object.Enum.Simple16
        4131528
           [12400000].batter.values
           Array.Object.RLE.Prefix Varint
        3615321
           [12400000].fieldedById.values.values
           Array.Object.Nullable.RLE.Prefix Varint
        2606724
           [12400000].batterName.indices.values
           Array.Object.Dictionary.RLE.Simple16
        2383628
           [12400000].fieldedByName.indices.values
           Array.Object.Dictionary.RLE.Simple16
        1550000
           [12400000].fielderThrowsCode.values.discriminants
           Array.Object.Nullable.Enum.Packed Boolean
        1550000
           [12400000].fielderThrowsDesc.values.discriminants
           Array.Object.Nullable.Enum.Packed Boolean
        1550000
           [12400000].fielderDraftPickNumber.opt
           Array.Object.Nullable.Packed Boolean
        1550000
           [12400000].fielderCollegeName.opt
           Array.Object.Nullable.Packed Boolean
        1395120
           [12400000].pitcher.values
           Array.Object.RLE.Prefix Varint
        877604
           [12400000].pitcherName.indices.values
           Array.Object.Dictionary.RLE.Simple16
        708872
           [12400000].teamId.values
           Array.Object.RLE.Simple16
        662888
           [12400000].batterBatsDesc.values.discriminants.runs
           Array.Object.Nullable.Enum.Bool RLE.Prefix Varint
        662888
           [12400000].batterBats.discriminants.runs
           Array.Object.Enum.Bool RLE.Prefix Varint
        595724
           [12400000].inPlayResult.values.discriminants.values
           Array.Object.Nullable.Enum.RLE.Simple16
        582020
           [12400000].baseValueEnd.runs
           Array.Object.RLE.Simple16
        544268
           [12400000].fieldedByPos.values.discriminants.values
           Array.Object.Nullable.Enum.RLE.Simple16
        512636
           [12400000].parentTeamId.values
           Array.Object.RLE.Simple16
        454928
           [12400000].teamName.indices.values
           Array.Object.Dictionary.RLE.Simple16
        406054
           [12400000].doublePlayOpp.runs
           Array.Object.Bool RLE.Prefix Varint
        395168
           [12400000].pitcherName.indices.runs
           Array.Object.Dictionary.RLE.Simple16
        395168
           [12400000].pitcher.runs
           Array.Object.RLE.Simple16
        385184
           [12400000].venueName.values
           Array.Object.RLE.UTF-8
        372556
           [12400000].teamName.indices.runs
           Array.Object.Dictionary.RLE.Simple16
        372556
           [12400000].teamId.runs
           Array.Object.RLE.Simple16
        366144
           [12400000].hitDataTrajectory.values.discriminants.runs.values
           Array.Object.Nullable.Enum.RLE.RLE.Simple16
        359092
           [12400000].outsStart.runs.values
           Array.Object.RLE.RLE.Simple16
        356528
           [12400000].baseValueStart.runs.values
           Array.Object.RLE.RLE.Simple16
        349784
           [12400000].parentTeamName.indices.runs
           Array.Object.Dictionary.RLE.Simple16
        349784
           [12400000].parentTeamId.runs
           Array.Object.RLE.Simple16
        341588
           [12400000].baseValueStart.values
           Array.Object.RLE.Simple16
        333332
           [12400000].baseValueEnd.values
           Array.Object.RLE.Simple16
        315972
           [12400000].outsEnd.runs.values
           Array.Object.RLE.RLE.Simple16
        305056
           [12400000].runs.runs
           Array.Object.RLE.Simple16
        300360
           [12400000].outsEnd.values
           Array.Object.RLE.Simple16
        292776
           [12400000].parentTeamName.indices.values
           Array.Object.Dictionary.RLE.Simple16
        271748
           [12400000].outsStart.values
           Array.Object.RLE.Simple16
        270036
           [12400000].inPlayResult.values.discriminants.runs.values
           Array.Object.Nullable.Enum.RLE.RLE.Simple16
        249124
           [12400000].hitDataTrajectory.values.discriminants.values
           Array.Object.Nullable.Enum.RLE.Simple16
        221728
           [12400000].fieldedByPos.values.discriminants.runs.values
           Array.Object.Nullable.Enum.RLE.RLE.Simple16
        205095
           [12400000].pitcherThrows.discriminants.runs
           Array.Object.Enum.Bool RLE.Prefix Varint
        205095
           [12400000].pitcherThrowsDesc.values.discriminants.runs
           Array.Object.Nullable.Enum.Bool RLE.Prefix Varint
        204077
           [12400000].leagueName.values.values
           Array.Object.Nullable.RLE.UTF-8
        191113
           [12400000].venueCity.values
           Array.Object.RLE.UTF-8
        186684
           [12400000].outsStart.runs.runs
           Array.Object.RLE.RLE.Simple16
        185620
           [12400000].outsEnd.runs.runs
           Array.Object.RLE.RLE.Simple16
        178216
           [12400000].hitDataTrajectory.values.discriminants.runs.runs
           Array.Object.Nullable.Enum.RLE.RLE.Simple16
        172820
           [12400000].inPlayResult.values.discriminants.runs.runs
           Array.Object.Nullable.Enum.RLE.RLE.Simple16
        169568
           [12400000].fieldedById.values.runs.values
           Array.Object.Nullable.RLE.RLE.Simple16
        168620
           [12400000].fieldedByName.indices.runs.values
           Array.Object.Dictionary.RLE.RLE.Simple16
        164660
           [12400000].baseValueStart.runs.runs
           Array.Object.RLE.RLE.Simple16
        163968
           [12400000].ballsStart.runs.values
           Array.Object.RLE.RLE.Simple16
        153476
           [12400000].fieldedByPos.values.discriminants.runs.runs
           Array.Object.Nullable.Enum.RLE.RLE.Simple16
        148692
           [12400000].strikesStart.runs.values
           Array.Object.RLE.RLE.Simple16
        131372
           [12400000].fieldedByName.indices.runs.runs
           Array.Object.Dictionary.RLE.RLE.Simple16
        130288
           [12400000].fieldedById.values.runs.runs
           Array.Object.Nullable.RLE.RLE.Simple16
        116871
           [12400000].fielderName.values
           Array.Object.Dictionary.UTF-8
        112456
           [12400000].fieldedByName.values
           Array.Object.Dictionary.UTF-8
        111558
           [12400000].fieldedByPos.opt.runs
           Array.Object.Nullable.Bool RLE.Prefix Varint
        111558
           [12400000].fieldedById.opt.runs
           Array.Object.Nullable.Bool RLE.Prefix Varint
        90788
           [12400000].ballsStart.values
           Array.Object.RLE.Simple16
        89565
           [12400000].sportName.values
           Array.Object.RLE.UTF-8
        81796
           [12400000].strikesStart.values
           Array.Object.RLE.Simple16
        77612
           [12400000].runs.values
           Array.Object.RLE.Simple16
        74787
           [12400000].gameDate.values
           Array.Object.RLE.UTF-8
        70674
           [12400000].batterName.values
           Array.Object.Dictionary.UTF-8
        63152
           [12400000].pitcherName.values
           Array.Object.Dictionary.UTF-8
        62480
           [12400000].ballsStart.runs.runs
           Array.Object.RLE.RLE.Simple16
        59728
           [12400000].hitDataContactQuality.values.discriminants.runs
           Array.Object.Nullable.Enum.RLE.Simple16
        58296
           [12400000].strikesStart.runs.runs
           Array.Object.RLE.RLE.Simple16
        44140
           [12400000].venueCapacity.values.values
           Array.Object.Nullable.RLE.Prefix Varint
        42997
           [12400000].fielderDob.values
           Array.Object.Dictionary.UTF-8
        36689
           [12400000].venueId.values
           Array.Object.RLE.Prefix Varint
        34740
           [12400000].venueId.runs
           Array.Object.RLE.Simple16
        34720
           [12400000].venueName.runs
           Array.Object.RLE.Simple16
        34708
           [12400000].venueCity.runs
           Array.Object.RLE.Simple16
        33628
           [12400000].venueCapacity.values.runs
           Array.Object.Nullable.RLE.Simple16
        32408
           [12400000].sportCode.values
           Array.Object.RLE.UTF-8
        30980
           [12400000].venueLeftLine.values.runs
           Array.Object.Nullable.RLE.Simple16
        30697
           [12400000].venueRetrosheetId.values
           Array.Object.RLE.UTF-8
        30668
           [12400000].venueRightLine.values.runs
           Array.Object.Nullable.RLE.Simple16
        29128
           [12400000].venueCenter.values.runs
           Array.Object.Nullable.RLE.Simple16
        24186
           [12400000].sportAbbr.values
           Array.Object.RLE.UTF-8
        23140
           [12400000].venueLeftLine.values.values
           Array.Object.Nullable.RLE.Simple16
        22920
           [12400000].venueRightLine.values.values
           Array.Object.Nullable.RLE.Simple16
        21856
           [12400000].leagueName.values.runs
           Array.Object.Nullable.RLE.Simple16
        21312
           [12400000].venueCenter.values.values
           Array.Object.Nullable.RLE.Simple16
        16604
           [12400000].fielderMlbDebut.values
           Array.Object.Dictionary.UTF-8
        15320
           [12400000].sportAbbr.runs
           Array.Object.RLE.Simple16
        15320
           [12400000].sportId.runs
           Array.Object.RLE.Simple16
        15320
           [12400000].sportCode.runs
           Array.Object.RLE.Simple16
        15320
           [12400000].sportLevelOfPlay.runs
           Array.Object.RLE.Simple16
        15320
           [12400000].sportName.runs
           Array.Object.RLE.Simple16
        14388
           [12400000].gameDate.runs
           Array.Object.RLE.Simple16
        13779
           [12400000].fielderCollegeName.values.values
           Array.Object.Nullable.Dictionary.UTF-8
        13312
           [12400000].hitDataContactQuality.values.discriminants.values
           Array.Object.Nullable.Enum.RLE.Simple16
        10008
           [12400000].venueRetrosheetId.runs
           Array.Object.RLE.Simple16
        8708
           [12400000].venueRightCenter.values.runs
           Array.Object.Nullable.RLE.Simple16
        8564
           [12400000].venueLeftCenter.values.runs
           Array.Object.Nullable.RLE.Simple16
        6992
           [12400000].venueLeft.opt.runs
           Array.Object.Nullable.Bool RLE.Prefix Varint
        6416
           [12400000].venueRightCenter.values.values
           Array.Object.Nullable.RLE.Simple16
        6272
           [12400000].venueLeftCenter.values.values
           Array.Object.Nullable.RLE.Simple16
        6267
           [12400000].venueRight.opt.runs
           Array.Object.Nullable.Bool RLE.Prefix Varint
        4956
           [12400000].sportId.values
           Array.Object.RLE.Simple16
        4517
           [12400000].teamName.values
           Array.Object.Dictionary.UTF-8
        3892
           [12400000].venueRoof.values.discriminants.runs
           Array.Object.Nullable.Enum.RLE.Simple16
        3880
           [12400000].venueLeft.values.runs
           Array.Object.Nullable.RLE.Simple16
        3520
           [12400000].sportLevelOfPlay.values
           Array.Object.RLE.Simple16
        3157
           [12400000].venueCenter.opt.runs
           Array.Object.Nullable.Bool RLE.Prefix Varint
        3096
           [12400000].venueRight.values.runs
           Array.Object.Nullable.RLE.Simple16
        2898
           [12400000].venueRightLine.opt.runs
           Array.Object.Nullable.Bool RLE.Prefix Varint
        2898
           [12400000].venueLeftLine.opt.runs
           Array.Object.Nullable.Bool RLE.Prefix Varint
        2816
           [12400000].venueLeft.values.values
           Array.Object.Nullable.RLE.Simple16
        2676
           [12400000].venueSurface.values.discriminants.runs
           Array.Object.Nullable.Enum.Bool RLE.Prefix Varint
        2435
           [12400000].fielderHeightStr.opt.runs
           Array.Object.Nullable.Bool RLE.Prefix Varint
        2344
           [12400000].venueRoof.opt.runs
           Array.Object.Nullable.Bool RLE.Prefix Varint
        2329
           [12400000].hitDataContactQuality.opt.runs
           Array.Object.Nullable.Bool RLE.Prefix Varint
        2268
           [12400000].venueRight.values.values
           Array.Object.Nullable.RLE.Simple16
        2203
           [12400000].hitDataSprayAngle.opt.runs
           Array.Object.Nullable.Bool RLE.Prefix Varint
        2203
           [12400000].hitDataTrajectory.opt.runs
           Array.Object.Nullable.Bool RLE.Prefix Varint
        2203
           [12400000].hitDataCalcDistance.opt.runs
           Array.Object.Nullable.Bool RLE.Prefix Varint
        1859
           [12400000].batterName.indices.runs.runs
           Array.Object.Dictionary.RLE.RLE.Prefix Varint
        1820
           [12400000].batter.runs.runs
           Array.Object.RLE.RLE.Prefix Varint
        1809
           [12400000].fielderBirthCountry.opt.runs
           Array.Object.Nullable.Bool RLE.Prefix Varint
        1558
           [12400000].venueCapacity.opt.runs
           Array.Object.Nullable.Bool RLE.Prefix Varint
        1481
           [12400000].venueRightCenter.opt.runs
           Array.Object.Nullable.Bool RLE.Prefix Varint
        1481
           [12400000].venueLeftCenter.opt.runs
           Array.Object.Nullable.Bool RLE.Prefix Varint
        1020
           [12400000].batterName.indices.runs.values
           Array.Object.Dictionary.RLE.RLE.Simple16
        996
           [12400000].batter.runs.values
           Array.Object.RLE.RLE.Simple16
        866
           [12400000].fielderWeight.opt.runs
           Array.Object.Nullable.Bool RLE.Prefix Varint
        725
           [12400000].parentTeamName.values
           Array.Object.Dictionary.UTF-8
        520
           [12400000].venueRoof.values.discriminants.values
           Array.Object.Nullable.Enum.RLE.Simple16
        343
           [12400000].fielderBirthCountry.values.values
           Array.Object.Nullable.Dictionary.UTF-8
        148
           [12400000].fielderHeightStr.values.values
           Array.Object.Nullable.Dictionary.UTF-8
        22
           [12400000].sportAffilliation.discriminants.runs
           Array.Object.Enum.RLE.Prefix Varint
        9
           [12400000].fielderThrowsDesc.opt.runs
           Array.Object.Nullable.Bool RLE.Prefix Varint
        9
           [12400000].fielderThrowsCode.opt.runs
           Array.Object.Nullable.Bool RLE.Prefix Varint
        4
           [12400000].inPlayResult.opt.runs
           Array.Object.Nullable.Bool RLE.Prefix Varint
        4
           [12400000].venueSurface.opt.runs
           Array.Object.Nullable.Bool RLE.Prefix Varint
        4
           [12400000].sportAffilliation.discriminants.values
           Array.Object.Enum.RLE.Simple16
        4
           [12400000].leagueName.opt.runs
           Array.Object.Nullable.Bool RLE.Prefix Varint
        4
           [12400000].batterBatsDesc.opt.runs
           Array.Object.Nullable.Bool RLE.Prefix Varint
        4
           [12400000].gameType.discriminants.runs
           Array.Object.Enum.Bool RLE.Prefix Varint
        4
           [12400000].pitcherThrowsDesc.opt.runs
           Array.Object.Nullable.Bool RLE.Prefix Varint

Largest by type:
         90x 158310524 @ Simple16
         41x 48837477 @ Prefix Varint
         2x 15366783 @ Gorilla
         4x 6200000 @ Packed Boolean
         19x 1474283 @ UTF-8

Other: 2623
Total: 230191690
That3Percent commented 4 years ago

Nice. This is less than 1/3 the size of the last test. This is also a significant improvement over Tableau Hyper now. As expected, these changes made significant improvements to one of my benchmarks as well. You'd be surprised how common it is to have a CSV with very repetitive data.

There are still major compression improvements coming down the pike. For example, delta compression might make a big difference for the fielder column (currently the largest) What does the data in that column look like?

Let's remember to be cognizant of the fact that the binary format for Tree-Buf is still changing in ways that may be backward incompatible. It seems like data persistence is an important part of your use-case so this could cause problems for you when you migrate to new versions. When you are actually ready to start saving files let's coordinate about what to do about that.

Looking forward to seeing how this fares on the main data set.