Currently we use a binary format that was created adhoc by just throwing things at scodec until it worked.
Admittedly, I think it's awesome that it works at all, and scodec is very fun to use.
We should actually design a binary format and try to stick to it.
Pros:
Enable other "client" implementations
perhaps pure handcrafted JS if we're worried about bundle size
or pure wasm for bundle size + performance
Help ensure compatibility
it is desirable to have future clients be able to read old indexes as library documentation indexes will be written with whatever version of protosearch was out at the time
Cons:
I have no idea what I'm doing. Designing a binary format seems hard.
Why are we using binary at all? Why not gzip some JSON?
Using JSON means we get to leverage existing JSON tools, encoders/decoders, jq for inspecting the file
I've long wanted an FST for the terms list, and Lucene encodes this into a byte array, so I've always assumed we'd need to support binary
we can likely save more space
we can likely get more performance by enabling readers to jump to various byte offsets in the file depending on what they need
Design Notes
include some magic bytes at the front to identify the file type
include index file format version
so we can evolve the format without breaking things
include a metadata format to include the version of protosearch the wrote the index
for better debugging
what to do about compression (gzip, zstd, etc)?
the whole index should be optionally compressed with gzip, zstd, or whatever compression algorithm the user desires
certain data structures will already be "compressed" in the sense that we may use tricks like variable byte integer encodings and storying the DocID deltas in a list of DocIDs
it's possible that we want stored fields to be compressed even inside the index file
probably the main goal here is to still enable jumplists over the stored fields
a chunked/block structure
each block indicates its type and length
allows extending an index with additional data that readers could optionally ignore
this probably makes the most sense for stored fields which could be quite long
unsure how this works with the compression point above
Currently we use a binary format that was created adhoc by just throwing things at scodec until it worked. Admittedly, I think it's awesome that it works at all, and scodec is very fun to use.
We should actually design a binary format and try to stick to it.
Pros:
Cons:
jq
for inspecting the file