eyalroz / libgiddy

Giddy - A lightweight GPU decompression library
BSD 3-Clause "New" or "Revised" License
42 stars 9 forks source link

Support changing dictionaries for the Dictionary compression scheme #7

Closed eyalroz closed 7 years ago

eyalroz commented 7 years ago

It is often the case that, locally, data in a column has limited support, while globally the support is much larger. If we're very lucky, this is captured by a model function, in which case we can use a compression scheme like FOR, FORLIN etc. But in other cases we just have a jumble of values which change over time. Example: the postal codes of students at a school.

For these cases it is useful to extend the Dictionary scheme so that once in a while the Dictionary can be replaced. Thus instead of the following inputs:

name length description
compressed_input n an index into the dictionary for each compressed element
dictionary_entries m an uncompressed value corresponding to each possible dictionary index
n 1 number of compressed (and decompressed) elements
m 1 number of dictionary entries

we'll have

name length description
compressed_input n an index into the dictionary for each compressed element
dictionary_entries (varies) an uncompressed value corresponding to each possible dictionary index
dictionary_application_start d the first position in the input (and the output) at which each dictionary comes into effect (the next dictionary start position is where the current dictionary goes out of effect)
dictionary_lengths d the number of dictionary entries in each of the dictionaries
n 1 number of compressed (and decompressed) elements
d 1 number of dictionaries

This should be useful for data such as that of the USDT-Ontime benchmark.

eyalroz commented 7 years ago

This would be partially or entirely obviated by supporting segmentation of the original dictionary scheme (issue #13) . Let's close it for now and see it it's still relevant after that has been resolved.