This is 4.5× faster than positions for my use case in the following representative benchmark.
Use case: to deal with serde Deserializers that do not support producing borrowed data (&'de str) we can easily just look up every string produced by the deserializer in the original input, and thereby turn a transient string (&str) into a borrowed string (&'de str) for each one found anywhere in the input. Specifically for serde_yaml, the performance of SuffixTable::new does not matter because we can run it in parallel with yaml-rust's Parser::load step, which will always take longer than the SuffixTable construction.
[If you happen to know: is suffix array even the right data structure for this use case? A key characteristic of the use case is that Σm = O(n) i.e. I only ever look up non-overlapping substrings of the original text, so the sum of lengths of all queries is no more than n. With a suffix array each query is O(m log n) so total O(n log n). From some naive napkin sketches I believe O(n) is achievable with a hash-based structure, with amortized O(m) time per query. I'm not really interested in inventing a novel algorithm though if this is a solved problem.]
#![feature(test)]
extern crate test;
use serde_json::Value;
use suffix::SuffixTable;
fn setup() -> (String, Vec<String>) {
const JSON: &str = include_str!("twitter.json"); // https://github.com/serde-rs/json-benchmark/tree/master/data
let json_value: Value = serde_json::from_str(JSON).unwrap();
let yaml = serde_yaml::to_string(&json_value).unwrap();
let mut strings = Vec::new();
let mut stack = Vec::new();
stack.push(json_value);
while let Some(value) = stack.pop() {
match value {
Value::Null | Value::Bool(_) | Value::Number(_) => {}
Value::String(string) => strings.push(string),
Value::Array(array) => stack.extend(array),
Value::Object(object) => {
for (key, value) in object {
strings.push(key);
stack.push(value);
}
}
}
}
(yaml, strings)
}
#[bench]
fn bench_positions(b: &mut test::Bencher) {
let (yaml, strings) = setup();
let index = SuffixTable::new(&yaml);
b.iter(|| {
for string in &strings {
test::black_box(index.positions(string));
}
});
}
#[bench]
fn bench_any_position(b: &mut test::Bencher) {
let (yaml, strings) = setup();
let index = SuffixTable::new(&yaml);
b.iter(|| {
for string in &strings {
test::black_box(index.any_position(string));
}
});
}
This is 4.5× faster than
positions
for my use case in the following representative benchmark.Use case: to deal with serde Deserializers that do not support producing borrowed data (
&'de str
) we can easily just look up every string produced by the deserializer in the original input, and thereby turn a transient string (&str
) into a borrowed string (&'de str
) for each one found anywhere in the input. Specifically for serde_yaml, the performance ofSuffixTable::new
does not matter because we can run it in parallel with yaml-rust'sParser::load
step, which will always take longer than the SuffixTable construction.[If you happen to know: is suffix array even the right data structure for this use case? A key characteristic of the use case is that Σm = O(n) i.e. I only ever look up non-overlapping substrings of the original text, so the sum of lengths of all queries is no more than n. With a suffix array each query is O(m log n) so total O(n log n). From some naive napkin sketches I believe O(n) is achievable with a hash-based structure, with amortized O(m) time per query. I'm not really interested in inventing a novel algorithm though if this is a solved problem.]