It is often the case that, locally, data in a column has limited support, while globally the support is much larger. If we're very lucky, this is captured by a model function, in which case we can use a compression scheme like FOR, FORLIN etc. But in other cases we just have a jumble of values which change over time. Example: the postal codes of students at a school.
For these cases it is useful to extend the Dictionary scheme so that once in a while the Dictionary can be replaced. Thus instead of the following inputs:
name
length
description
compressed_input
n
an index into the dictionary for each compressed element
dictionary_entries
m
an uncompressed value corresponding to each possible dictionary index
n
1
number of compressed (and decompressed) elements
m
1
number of dictionary entries
we'll have
name
length
description
compressed_input
n
an index into the dictionary for each compressed element
dictionary_entries
(varies)
an uncompressed value corresponding to each possible dictionary index
dictionary_application_start
d
the first position in the input (and the output) at which each dictionary comes into effect (the next dictionary start position is where the current dictionary goes out of effect)
dictionary_lengths
d
the number of dictionary entries in each of the dictionaries
n
1
number of compressed (and decompressed) elements
d
1
number of dictionaries
This should be useful for data such as that of the USDT-Ontime benchmark.
This would be partially or entirely obviated by supporting segmentation of the original dictionary scheme (issue #13) . Let's close it for now and see it it's still relevant after that has been resolved.
It is often the case that, locally, data in a column has limited support, while globally the support is much larger. If we're very lucky, this is captured by a model function, in which case we can use a compression scheme like FOR, FORLIN etc. But in other cases we just have a jumble of values which change over time. Example: the postal codes of students at a school.
For these cases it is useful to extend the Dictionary scheme so that once in a while the Dictionary can be replaced. Thus instead of the following inputs:
we'll have
This should be useful for data such as that of the USDT-Ontime benchmark.