Is this valid for index size reduction?

waddyano commented 2 years ago

This might really just apply to ice but thought it might be better to ask here

I have been experimenting with using bluge for indexing text files. Main fields are indexed and not stored. To me the index size is rather large so I have been looking at ways to reduce the size.

After cobbling together some code to print what data consumed space I found the bulk of it by far is Location lists in the posting data.

Looking at what the processing generated I tried just storing all location information as deltas. End as the offset from start and the next start as the offset from the previous end except for the first. Everything seems to stay in increasing sequence and is always processed in sequence so this seems to work fine. This is per location list. Since everything is varint's the smaller the integers the smaller the space needed for a little bit of arithmetic during read.

The second observation was that it didn't seem necessary to store the field number if every location - so I removed it and just picked it up from the dictionary the list belonged to.

I did create a version 2 format and locally have code which can read/write both formats in one module.

So far this seems to work and reduced my index size by 38% - is there some case where this will go wrong? And is their interest in me trying to put together an real change for this or I just create my own segment plugin.

mschoch commented 2 years ago

Thanks for looking into this. The delta encoding for locations makes sense, but I would want to review the changes to better understand the impact of the change. Regarding the field number, the reason it has to be stored for every location is to support searching composite fields, and being able to remember which original field it came from. It's a useful feature, but I'm certainly open to ideas to save space wasted in this area.

I would be open to reviewing the changes for the delta encoding, changing the field number may require further discussion.

waddyano commented 2 years ago

Thanks for the reply - will try and put something together for you to look at soon - and also see if I can study composite fields

waddyano commented 2 years ago

Running the tests helped me see the composite problem. Should have done that earlier. Since then I have coped with composite fields and thought of more improvements.

For reference the current state of my optimizations is this commit https://github.com/waddyano/ice/commit/a5ffbeed6da25e6f931276d5b881feb933b3e8f8

mschoch commented 2 years ago

@waddyano I took a quick look. I didn't review closely, but the approach looks good from reading the description.

One thing I see is that you have sections of code guarded with a condition like if Version == 2 {. We thought over time this would lead to a code-base that was difficult to maintain. Instead we started with the model that we would instead use the semantic version major number to represent the file format. This lets the blugelabs/ice repo support different versions, each on a branch in the repository, then we can tag and release versions as needed. If we follow this approach, master would only need to support v2 files. If we ever need to do an update to v1, we can branch off and release as needed.

waddyano commented 2 years ago

Thanks for the comments - I am used to the extra complexity with less code duplication but will adjust

blugelabs / bluge

Is this valid for index size reduction? #98