Closed order-flow-labs closed 1 year ago
There is a nice CPU efficient input format implemented in VW (Flatbuffers)
Command line example: vw --cb_force_legacy --cb 2 -d train-sets/rcv1_raw_cb_small.fb --flatbuffer
Unfortunately, this feature is off by default. You can turn it on when you build VW.
https://github.com/VowpalWabbit/vowpal_wabbit/blob/de5230316aa60f77704a08cb8d95a175cd50fe67/CMakeLists.txt#L53
Here is the schema: https://github.com/VowpalWabbit/vowpal_wabbit/blob/master/vowpalwabbit/fb_parser/schema/example.fbs
Helpful reference. Converting to flatbuffers https://github.com/VowpalWabbit/vowpal_wabbit/blob/master/test/runtests_flatbuffer_converter.py
There are extensive tests for this feature: https://github.com/VowpalWabbit/vowpal_wabbit/blob/de5230316aa60f77704a08cb8d95a175cd50fe67/test/core.vwtest.json#L3100
If you feel you can contribute some time to the projects, we welcome your involvement. The following link is a PR that's almost ready to go. It defines a more compact Flatbuffer format with similar performance. I am happy to guide you through it if you would like to push forward with it. C++ experience would be very helpful. https://github.com/VowpalWabbit/vowpal_wabbit/pulls?q=is%3Apr+is%3Aclosed+flatbuffer
Please note that the input format in the PR above is where we want to eventually end up.
Feel free to reach out to me if you would like to push this item forward for the benefit of the community.
Closing for now, please feel free to re-open
Add support for allowing binary input files
Short description
As far as I can tell from the documentation there is not support for binary files as the input source for VW. I have VW integrated into a pretty extensive research system that generates massive files (TB+ sizes). The data set generation process is geared towards regressions where the data is very well defined and consistently formatted. The bottleneck comes from the feature generation process having to write floats to text files (which is a notorious slowdown). Furthermore VW just reads this in and parses the string representation of the float back into a float. Support for just reading inputs from a binary file with a schema flag would allow for a substantial speed up for a lot of use cases.
How this suggestion will help you/others
This would lead to a much faster research process, a speed up on the VW side too as parsing inputs is not required. Raw casting of the byte buffers to floats usually has no measurable overhead in C++. It wouldn't be a feature all users of VW would need but there is definitely a subset which would gain a lot from this capability.
Possible solution/implementation details
new flags:
This can be implemented where lines are still delimited by '\n' but the contents leading up to it are a raw byte vector. Would only make sense for logistic/linear regression type tasks where the data is highly consistent from one record to the next.