Open irevoire opened 1 month ago
Sorry for the delay. I'm not sure I understand the question. Of course you can use the same probabilistic model to independently compress/decompress several messages. And yes, in this case you have to keep the model in memory only once, since compression and decompression don't consume the model, they only need a reference to it. Admittedly, this is a bit obscured by the generic nature of the API; for example, the method AnsCoder::encode_iid_symbols_reverse
takes a generic argument model
whose type has to implement EncoderModel
, so it may indeed seem like you'd have to provide a fresh entropy model every time. But there's a blanket implementation of EncoderModel
for any reference &M
where M
implements EncoderModel
, so you only need a single owned EncoderModel
and can hand out as many shared references to it as you like (some small entropy models also implement Copy
; for those, it's usually more performant to pass them by value).
I'm attaching an example of a full compression/decompression round trip below. But in brief, if I understand correctly what you're trying to achieve, then your struct for the compressed representation of Index
should probably look something like this:
struct CompressedIndex {
doc: Vec<Vec<u32>>, // Note that constriction represents bit strings in 32-bit chunks by default for performance reasons.
probs: DefaultContiguousCategoricalEntropyModel, // (for example; you can use any entropy model in `constriction::stream::model`)
alphabet: Vec<char>, // List of all distinct characters that can appear in a message (see full example below).
}
And there's nothing that holds you back from encoding or decoding each entry of doc
independently, using the shared entropy model probs
and the shared alphabet
(see full round-trip example below).
From what I’ve seen, it seems like we need to provide the probabilities for the symbol we're currently compressing.
I'm not sure I understand. Of course you have to provide the probabilities anytime you encode or decode a symbol (in fact, you have to provide the entire entropy model, not just the probability of the specific symbol you're currently encoding or decoding). That's not a limitation of constriction
, it's a fundamental theoretical limitation of source coding: one cannot (losslessly) compress data without a probabilistic model of the data source ("source coding theorem").
use std::collections::HashMap;
use constriction::{
backends::Cursor,
stream::{
model::DefaultContiguousCategoricalEntropyModel, stack::DefaultAnsCoder, Decode, Encode,
},
UnwrapInfallible,
};
#[derive(Debug, PartialEq, Eq)]
struct UncompressedIndex {
doc: Vec<String>,
}
#[derive(Debug)]
struct CompressedIndex {
doc: Vec<Vec<u32>>, // Note that constriction represents bit strings in 32-bit chunks by default for performance reasons.
probs: DefaultContiguousCategoricalEntropyModel, // (for example; you can use any entropy model in `constriction::stream::model`)
alphabet: Vec<char>, // List of all distinct characters that can appear in a message.
}
impl UncompressedIndex {
fn compress(
&self,
probs: DefaultContiguousCategoricalEntropyModel,
alphabet: Vec<char>,
) -> CompressedIndex {
let inverse_alphabet = alphabet
.iter()
.enumerate()
.map(|(index, &character)| (character, index))
.collect::<HashMap<_, _>>();
let doc = self
.doc
.iter()
.map(|message| {
let mut coder = DefaultAnsCoder::new();
// Start with a special EOF symbol so that `CompressedIndex::decompress` knows when to terminate:
coder.encode_symbol(alphabet.len(), &probs).unwrap();
// Then encode the message, character by character, in reverse order:
for character in message.chars().rev() {
let char_index = *inverse_alphabet.get(&character).unwrap();
coder.encode_symbol(char_index, &probs).unwrap();
}
coder.into_compressed().unwrap_infallible()
})
.collect();
CompressedIndex {
doc,
probs,
alphabet,
}
}
}
impl CompressedIndex {
fn decompress(&self) -> UncompressedIndex {
let doc = self
.doc
.iter()
.map(|data| {
let mut coder =
DefaultAnsCoder::from_compressed(Cursor::new_at_write_end(&data[..])).unwrap();
core::iter::from_fn(|| {
let symbol_id = coder.decode_symbol(&self.probs).unwrap();
self.alphabet.get(symbol_id).copied() // Returns `None` if `symbol_id` is the EOF token, which terminates the iterator.
})
.collect()
})
.collect();
UncompressedIndex { doc }
}
}
#[test]
fn round_trip() {
let uncompressed = UncompressedIndex {
doc: vec!["Hello, World!".to_string(), "Goodbye.".to_string()],
};
let alphabet = vec![
'H', 'e', 'l', 'o', ',', ' ', 'W', 'r', 'd', '!', 'G', 'b', 'y', '.',
];
let counts = [1., 2., 3., 4., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 2.]; // The last entry is for the EOF token.
let probs =
DefaultContiguousCategoricalEntropyModel::from_floating_point_probabilities(&counts)
.unwrap();
let compressed = uncompressed.compress(probs, alphabet);
let reconstructed = compressed.decompress();
assert_eq!(uncompressed, reconstructed);
}
Hello,
In my use case, I have a
Vec
ofString
in a structure that I want to compress. But I need to keepO(1)
access to the element in theVec
, so I was thinking about using TANS and storing my probability table on the side: Before:After:
Is this library supposed to support this? From what I’ve seen, it seems like we need to provide the probabilities for the symbol we're currently compressing.