That3Percent / tree-buf

An experimental serialization system written in Rust
MIT License
264 stars 8 forks source link

Unit enums not yet supported by tree-buf write #2

Closed elibenporat closed 4 years ago

elibenporat commented 4 years ago
pub enum GameType {
    /// Regular Season
    R,
    /// First Round
    F,
    /// Division Series
    D,
    /// League Championship Series
    L,
    /// World Series
    W,
    /// Championship
    C,
    /// Nineteenth Century Series
    N,
    /// All Star Game
    A,
    /// Spring Training
    S,
    /// Exhibition Game
    E,
    /// Intrasquad
    I,
    ///Playoffs
    P,
}

Fails for the Write macro, with the following error message: Unit enums not yet supported by tree-buf write

Is there a workaround? I have a lot of these types of enums. Working off master.

That3Percent commented 4 years ago

There is a workaround temporary, but it's not ideal.

use tree_buf::Ignore;
pub enum GameType {
   R(Ignored),
   F(Ignored),
   // etc.
}

I will prioritize this feature as soon as #3 is complete.

Just an aside, in Tree-Buf, the name of an enum is stored just once. So for a large file, it would be fine to fully spell out the game types like so:

pub enum GameType {
   RegularSeason,
   DivisionSeries,
   // etc
}

with only the slightest increase in file size.

elibenporat commented 4 years ago

There's a GameTypeDescription that looks exactly like that. The GameType in the MLB Stats API is simply the one letter, so the enum matches that to save a bunch of serde(rename)s.

I'll watch this issue for updates on this. I'm hoping this will allow me to store the entire data set in-memory, without having to write my own hand-crafted data structure. I'm really curious how close to Tableau Hyper you'll get.

As an aside, in #1 you mentioned that floating point compression panics at runtime? A lot of my data looks like start_speed = 92.75 which would be an ideal use case for compression. Could also be a good candidate for delta compression as there are very tight ranges in a lot of these cases.

That3Percent commented 4 years ago

I just closed #3, so this issue is now on deck.

Can you tell me more about the requirement to store the entire data set in memory? The compression techniques used in Tree-Buf are at odds with randomly accessing data. It would be possible to write something which streamed data out of a Tree-Buf file, but that's not yet implemented. There are a couple of formats that would be better for random-access without parsing, but of course, they sacrifice compression to achieve that - Flatbuffers and Capn Proto come to mind.

As for the panic with floating-point compression, this was fixed with #3 which swapped out Gorilla for Zfp to do the compression of floats.

For baseball stats, what you probably want is actually a decimal type A number like 92.75 isn't even representable exactly by f64 or f32. I plan to add a decimal type for stats and finances, but it's not implemented yet. For now, yes, the floating-point compression should make the file much smaller. Be sure to experiment with the lossy compression here using something like tree_buf::write_with_options(data, &encode_options! { options::LossyFloatTolerance(-7) }), which would give you accuracy within 0.0078125 (7 binary points of precision). You can play with the formula here

elibenporat commented 4 years ago

Basically, as long as I can get the entire data set into an iterator, that's really all I need (I think). The most important feature is having a format that I can persist to disk that is space efficient. Anything else is a bonus. I want to be able to pitch this as "possible to fit all of baseball onto a standard laptop with 16 GB of memory". It will be orders of magnitude more performant than anything else out there, since I'm essentially competing with Python and R implementations.

Long story short, as long as I can load it all into an iterator which can fit into memory, that's more than enough. For example, I'll need to load it all in and then create a set of all GamePKs that I have, so that I can only pull new games. The primary bottleneck is waiting for the API to get data, so even if it takes 10-20 seconds to load and the build a set, it's still a minor cost compared to the network cost.

Basically, I would load the Tree-Buf into memory. Then, I'd pull out a set of all GamePKs. Pull new game data, process it and then append the original Tree-Buf + new data into a brand new Tree-Buf. If it can be appended to like a CSV, great, if I have to erase the old one and write a new one from scratch, that's perfectly fine as well.

I think the LossyFloatTolerance should be perfect for these data. All the numbers are approximations by imperfect capturing devices anyway, so this is perfect.

That3Percent commented 4 years ago

@elibenporat I've moved the broader discussion to #6 so we can keep this issue focused just on unit enums, and we have a place to discuss the best way to use Tree-Buf within BOSS.

That3Percent commented 4 years ago

Implemented with Macros: Support unit enums