Closed elibenporat closed 4 years ago
There is a workaround temporary, but it's not ideal.
use tree_buf::Ignore;
pub enum GameType {
R(Ignored),
F(Ignored),
// etc.
}
I will prioritize this feature as soon as #3 is complete.
Just an aside, in Tree-Buf, the name of an enum is stored just once. So for a large file, it would be fine to fully spell out the game types like so:
pub enum GameType {
RegularSeason,
DivisionSeries,
// etc
}
with only the slightest increase in file size.
There's a GameTypeDescription that looks exactly like that. The GameType in the MLB Stats API is simply the one letter, so the enum matches that to save a bunch of serde(rename)s.
I'll watch this issue for updates on this. I'm hoping this will allow me to store the entire data set in-memory, without having to write my own hand-crafted data structure. I'm really curious how close to Tableau Hyper you'll get.
As an aside, in #1 you mentioned that floating point compression panics at runtime? A lot of my data looks like start_speed = 92.75
which would be an ideal use case for compression. Could also be a good candidate for delta compression as there are very tight ranges in a lot of these cases.
I just closed #3, so this issue is now on deck.
Can you tell me more about the requirement to store the entire data set in memory? The compression techniques used in Tree-Buf are at odds with randomly accessing data. It would be possible to write something which streamed data out of a Tree-Buf file, but that's not yet implemented. There are a couple of formats that would be better for random-access without parsing, but of course, they sacrifice compression to achieve that - Flatbuffers and Capn Proto come to mind.
As for the panic with floating-point compression, this was fixed with #3 which swapped out Gorilla for Zfp to do the compression of floats.
For baseball stats, what you probably want is actually a decimal type A number like 92.75
isn't even representable exactly by f64
or f32
. I plan to add a decimal type for stats and finances, but it's not implemented yet. For now, yes, the floating-point compression should make the file much smaller. Be sure to experiment with the lossy compression here using something like tree_buf::write_with_options(data, &encode_options! { options::LossyFloatTolerance(-7) })
, which would give you accuracy within 0.0078125
(7 binary points of precision). You can play with the formula here
Basically, as long as I can get the entire data set into an iterator, that's really all I need (I think). The most important feature is having a format that I can persist to disk that is space efficient. Anything else is a bonus. I want to be able to pitch this as "possible to fit all of baseball onto a standard laptop with 16 GB of memory". It will be orders of magnitude more performant than anything else out there, since I'm essentially competing with Python and R implementations.
Long story short, as long as I can load it all into an iterator which can fit into memory, that's more than enough. For example, I'll need to load it all in and then create a set of all GamePKs that I have, so that I can only pull new games. The primary bottleneck is waiting for the API to get data, so even if it takes 10-20 seconds to load and the build a set, it's still a minor cost compared to the network cost.
Basically, I would load the Tree-Buf into memory. Then, I'd pull out a set of all GamePKs. Pull new game data, process it and then append the original Tree-Buf + new data into a brand new Tree-Buf. If it can be appended to like a CSV, great, if I have to erase the old one and write a new one from scratch, that's perfectly fine as well.
I think the LossyFloatTolerance should be perfect for these data. All the numbers are approximations by imperfect capturing devices anyway, so this is perfect.
@elibenporat I've moved the broader discussion to #6 so we can keep this issue focused just on unit enums, and we have a place to discuss the best way to use Tree-Buf within BOSS.
Implemented with Macros: Support unit enums
Fails for the Write macro, with the following error message: Unit enums not yet supported by tree-buf write
Is there a workaround? I have a lot of these types of enums. Working off master.