VowpalWabbit / vowpal_wabbit

Vowpal Wabbit is a machine learning system which pushes the frontier of machine learning with techniques such as online, hashing, allreduce, reductions, learning2search, active, and interactive learning.
https://vowpalwabbit.org
Other
8.49k stars 1.93k forks source link

Binary File Inputs #4637

Closed order-flow-labs closed 1 year ago

order-flow-labs commented 1 year ago

Add support for allowing binary input files

Short description

As far as I can tell from the documentation there is not support for binary files as the input source for VW. I have VW integrated into a pretty extensive research system that generates massive files (TB+ sizes). The data set generation process is geared towards regressions where the data is very well defined and consistently formatted. The bottleneck comes from the feature generation process having to write floats to text files (which is a notorious slowdown). Furthermore VW just reads this in and parses the string representation of the float back into a float. Support for just reading inputs from a binary file with a schema flag would allow for a substantial speed up for a lot of use cases.

How this suggestion will help you/others

This would lead to a much faster research process, a speed up on the VW side too as parsing inputs is not required. Raw casting of the byte buffers to floats usually has no measurable overhead in C++. It wouldn't be a feature all users of VW would need but there is definitely a subset which would gain a lot from this capability.

Possible solution/implementation details

new flags:

--binary_file=true/false
--binary_file_schema=JSON schema of if feature weight is included, names of each column etc.

This can be implemented where lines are still delimited by '\n' but the contents leading up to it are a raw byte vector. Would only make sense for logistic/linear regression type tasks where the data is highly consistent from one record to the next.

rajan-chari commented 1 year ago

There is a nice CPU efficient input format implemented in VW (Flatbuffers)

Command line example: vw --cb_force_legacy --cb 2 -d train-sets/rcv1_raw_cb_small.fb --flatbuffer

Unfortunately, this feature is off by default. You can turn it on when you build VW.
https://github.com/VowpalWabbit/vowpal_wabbit/blob/de5230316aa60f77704a08cb8d95a175cd50fe67/CMakeLists.txt#L53

Here is the schema: https://github.com/VowpalWabbit/vowpal_wabbit/blob/master/vowpalwabbit/fb_parser/schema/example.fbs

Helpful reference. Converting to flatbuffers https://github.com/VowpalWabbit/vowpal_wabbit/blob/master/test/runtests_flatbuffer_converter.py

There are extensive tests for this feature: https://github.com/VowpalWabbit/vowpal_wabbit/blob/de5230316aa60f77704a08cb8d95a175cd50fe67/test/core.vwtest.json#L3100

If you feel you can contribute some time to the projects, we welcome your involvement. The following link is a PR that's almost ready to go. It defines a more compact Flatbuffer format with similar performance. I am happy to guide you through it if you would like to push forward with it. C++ experience would be very helpful. https://github.com/VowpalWabbit/vowpal_wabbit/pulls?q=is%3Apr+is%3Aclosed+flatbuffer

Please note that the input format in the PR above is where we want to eventually end up.

rajan-chari commented 1 year ago

Feel free to reach out to me if you would like to push this item forward for the benefit of the community.

olgavrou commented 1 year ago

Closing for now, please feel free to re-open