BurntSushi / suffix

Fast suffix arrays for Rust (with Unicode support).
The Unlicense
263 stars 30 forks source link

Add `any_position` to get arbitrary one of the positions #15

Closed dtolnay closed 2 years ago

dtolnay commented 2 years ago

This is 4.5× faster than positions for my use case in the following representative benchmark.

Use case: to deal with serde Deserializers that do not support producing borrowed data (&'de str) we can easily just look up every string produced by the deserializer in the original input, and thereby turn a transient string (&str) into a borrowed string (&'de str) for each one found anywhere in the input. Specifically for serde_yaml, the performance of SuffixTable::new does not matter because we can run it in parallel with yaml-rust's Parser::load step, which will always take longer than the SuffixTable construction.

[If you happen to know: is suffix array even the right data structure for this use case? A key characteristic of the use case is that Σm = O(n) i.e. I only ever look up non-overlapping substrings of the original text, so the sum of lengths of all queries is no more than n. With a suffix array each query is O(m log n) so total O(n log n). From some naive napkin sketches I believe O(n) is achievable with a hash-based structure, with amortized O(m) time per query. I'm not really interested in inventing a novel algorithm though if this is a solved problem.]

#![feature(test)]
extern crate test;

use serde_json::Value;
use suffix::SuffixTable;

fn setup() -> (String, Vec<String>) {
    const JSON: &str = include_str!("twitter.json"); // https://github.com/serde-rs/json-benchmark/tree/master/data
    let json_value: Value = serde_json::from_str(JSON).unwrap();
    let yaml = serde_yaml::to_string(&json_value).unwrap();

    let mut strings = Vec::new();
    let mut stack = Vec::new();
    stack.push(json_value);
    while let Some(value) = stack.pop() {
        match value {
            Value::Null | Value::Bool(_) | Value::Number(_) => {}
            Value::String(string) => strings.push(string),
            Value::Array(array) => stack.extend(array),
            Value::Object(object) => {
                for (key, value) in object {
                    strings.push(key);
                    stack.push(value);
                }
            }
        }
    }

    (yaml, strings)
}

#[bench]
fn bench_positions(b: &mut test::Bencher) {
    let (yaml, strings) = setup();
    let index = SuffixTable::new(&yaml);
    b.iter(|| {
        for string in &strings {
            test::black_box(index.positions(string));
        }
    });
}

#[bench]
fn bench_any_position(b: &mut test::Bencher) {
    let (yaml, strings) = setup();
    let index = SuffixTable::new(&yaml);
    b.iter(|| {
        for string in &strings {
            test::black_box(index.any_position(string));
        }
    });
}
BurntSushi commented 2 years ago

This is on crates.io in suffix 1.3.0.