BurntSushi / regex-automata

A low level regular expression library that uses deterministic finite automata.
The Unlicense
351 stars 26 forks source link
automata automaton dfa nfa regex regex-engine regexp rust text-processing

WARNING: This repository is now archived. The regex-automata crate now resides at https://github.com/rust-lang/regex

regex-automata

A low level regular expression library that uses deterministic finite automata. It supports a rich syntax with Unicode support, has extensive options for configuring the best space vs time trade off for your use case and provides support for cheap deserialization of automata for use in no_std environments.

Build status Crates.io Minimum Supported Rust Version 1.41

Dual-licensed under MIT or the UNLICENSE.

Documentation

https://docs.rs/regex-automata

Usage

Add this to your Cargo.toml:

[dependencies]
regex-automata = "0.1"

WARNING: The master branch currently contains code for the 0.2 release, but this README still targets the 0.1 release. Namely, it is recommended to stick with the 0.1 release. The 0.2 release was made prematurely in order to unblock some folks.

Example: basic regex searching

This example shows how to compile a regex using the default configuration and then use it to find matches in a byte string:

use regex_automata::Regex;

let re = Regex::new(r"[0-9]{4}-[0-9]{2}-[0-9]{2}").unwrap();
let text = b"2018-12-24 2016-10-08";
let matches: Vec<(usize, usize)> = re.find_iter(text).collect();
assert_eq!(matches, vec![(0, 10), (11, 21)]);

For more examples and information about the various knobs that can be turned, please see the docs.

Support for no_std

This crate comes with a std feature that is enabled by default. When the std feature is enabled, the API of this crate will include the facilities necessary for compiling, serializing, deserializing and searching with regular expressions. When the std feature is disabled, the API of this crate will shrink such that it only includes the facilities necessary for deserializing and searching with regular expressions.

The intended workflow for no_std environments is thus as follows:

Deserialization can happen anywhere. For example, with bytes embedded into a binary or with a file memory mapped at runtime.

Note that the ucd-generate tool will do the first step for you with its dfa or regex sub-commands.

Cargo features

Differences with the regex crate

The main goal of the regex crate is to serve as a general purpose regular expression engine. It aims to automatically balance low compile times, fast search times and low memory usage, while also providing a convenient API for users. In contrast, this crate provides a lower level regular expression interface that is a bit less convenient while providing more explicit control over memory usage and search times.

Here are some specific negative differences:

With some of the downsides out of the way, here are some positive differences:

Future work