Make accepted variable names configurable and do not use regex for it by default

terrorfisch commented 2 years ago

The regex crate is currently used for checking if a string starts with a valid identifier or is a valid identifier. From what I know this is a bit overkill and it's faster to write simple matches explicitly. I suggest doing something like this:

pub trait VarName {
    fn is_start_character(c: char) -> bool;
    fn is_continue_character(c: char) -> bool {
        Self::is_start_character(c) || matches!(c,  '0'..='9')
    }
    fn try_parse(s: &str) -> Option<(&str, &str)> {
        let mut first = true;
        let (name, rest) = s.split_once(|c|
            if first {
                first = false;
                !Self::is_start_character(c)
            } else {
                !Self::is_continue_character(c)
            }
        ).unwrap_or((s, ""));
        if name.is_empty() {
            None
        } else {
            Some((name, rest))
        }
    }
    fn is_exact_variable_name(s: &str) -> bool {
        s.starts_with(Self::is_start_character) && s.chars().skip(1).all(Self::is_continue_character)
    }
}

pub struct ASCII;

impl VarName for ASCII {
    fn is_start_character(c: char) -> bool {
        matches!(c, 'a'..='z' | 'A'..='Z' | '_')
    }
}

pub struct LatinGreek;

impl VarName for LatinGreek {
    fn is_start_character(c: char) -> bool {
        ASCII::is_start_character(c) || matches!(c, 'α'..='ω' || 'Α'..='Ω')
    }
}

#[cfg(feature = "unicode")]
pub struct Unicode;

#[cfg(feature = "unicode")]
impl VarName for LatinGreek {
    fn is_start_character(c: char) -> bool {
        unicode_ident::is_xid_start(c)
    }
    fn is_start_character(c: char) -> bool {
        unicode_ident::is_xid_continue(c)
    }
}

with unicode-ident = { version = "1", optional = true } and a feature unicode = ["unicode-ident"].

If you think this is a good idea I can make a pull request.

terrorfisch commented 2 years ago

Somehow I overlooked literal_matcher_from_pattern that depends on regex so dropping the dependency is of the table for now.

bertiqwerty commented 2 years ago

I benchmarked using regexes some time ago. I could never identify compiled regexes as some kind of performance bottleneck.

terrorfisch commented 2 years ago

After testing I came to the same conclusion. The only reason for this for me would be to allow custom non-latin/greek names i.e. all valid python identifiers although this is by far not a pressing issue.

bertiqwerty / exmex

Make accepted variable names configurable and do not use regex for it by default #47