Is it possible support more unicode regex rule?

chenbaiyu0414 commented 5 months ago

It seems current #[rule] or #[define] attributes are not support \p{..} regex expr, and it's useful when scanning some complex characters combination.

For example, if I want to support unicode but not just ascii identifiers, the rule usually used is XID_START ~ XID_CONTINUE*, and the code maybe looks like:

#[derive(Token, Debug, Clone, Copy, PartialEq, Eq)]
#[define(XID_START = r"\p{XID_START}")]
#[define(XID_CONTINUE = r"\p{XID_CONTINUE}")]
#[repr(u8)]
pub enum SourceToken {
    EOI = 0,
    Mismatch = 1,

    #[priority(1)]
    #[rule(XID_START XID_CONTINUE*)]
    Ident,
}

or maybe ld could provide more common and useful pre-defined operators just like ld already provided $lower, $upper. Thanks.

Eliah-Lakhin commented 5 months ago

@chenbaiyu0414 Providing more Unicode properties is possible, but implementing the entire Unicode code property specification would be a challenge, so I would do it more iteratively, based on specific feature requests.

The syntax will be in the form of $<class> for consistency with the existing regex syntax.

My question to you: Do you need this feature specifically for "XID_START"/"XID_CONTINUE" (e.g., as defined in Rust documentation), or did you have other specific classes in mind?

chenbaiyu0414 commented 5 months ago

@Eliah-Lakhin Currently I just need XID_START and XID_CONTINUE. However, I suggest that you might consider providing a set of predefined rules, just like pest does, and you can add more as users come up with additional feature requests.

And Pest Builtin Rules might offer some reference for you.

Eliah-Lakhin commented 5 months ago

@chenbaiyu0414 The issue-14-xid-start-and-continue-classes branch contains two new character classes: $xid_start and $xid_continue that should address the issue of Unicode identifiers.

The following code:

use lady_deirdre::lexis::{SourceCode, Token, TokenBuffer};

#[derive(Token, Copy, Clone, PartialEq, Eq, Debug)]
#[repr(u8)]
enum Tok {
    EOI = 0,
    Mismatch = 1,

    #[rule($xid_start $xid_continue*)]
    Ident,

    #[rule("|")]
    Sep,
}

let buf = TokenBuffer::<Tok>::from("букваЩ|123Word456");

for chunk in buf.chunks(..) {
    println!("{:?}: {:?}", chunk.token, chunk.string);
}

outputs:

Ident: "букваЩ"
Sep: "|"
Mismatch: "123"
Ident: "Word456"

In this feature branch, I performed a general refactoring of the Unicode matching subsystem, fixing various edge-case bugs. The main outcome is that after this refactoring, it should be much easier to add new properties upon request.

For now, I have decided to limit the design to a small collection of built-in classes that should cover the majority of use cases:

$upper for upper-case letters
$lower for lower-case letters
$num for numeric characters
$space for any whitespace character
$alpha for any alphabetic character
$xid_start/$xid_continue for identifier characters.

However, if there are more property requests, I will consider extending this set to something comparable to Pest.

Additionally, you can define your own composite classes by combining properties together: ${alpha | num} (for alphabetic and numeric characters). For example, you can define inline expressions using this mechanism: #[define(ALPHANUM = ${alpha | num})], and then use the ALPHANUM symbol inside the rule expressions.

In particular, this mechanism opens up the possibility of introducing Pest-like syntax for the Script properties. After this inner refactoring, I believe it should be feasible.

This new design comes with certain limitations imposed on the Choice operator (A | B) because the Choice operator in regular grammars does not have order (in contrast to PEG grammars). The $alpha | $num syntax is currently forbidden. However, it can be worked around using composite classes in certain cases: ${alpha | num}. I will consider reducing this limitation in future versions as well.

I would like to hear your feedback. If you think it's working correctly, I will proceed with merging to master.

chenbaiyu0414 commented 5 months ago

@Eliah-Lakhin I have tested issue-14-xid-start-and-continue-classes and xid_start, xid_continue both work perfetcly. And here I have anohter suggestion once lexer extend code page to unicode: because uppercase and lowercase have total different meaning in ascii and unicode page, and I suggest if $upper only means ascii uppercase chars, maybe use $ascii_upper is more meaningful, and same as $lower and $alpha, give it a significant code page prefix is better.

Eliah-Lakhin commented 5 months ago

@chenbaiyu0414 Thank you for your review and for the Pest reference. It was very useful. I will proceed with merging to master.

Regarding the names and the ASCII properties, I agree with your point that if we have a wide range of property support, their names should be better thought out. In particular, I agree that $ascii_upper should be part of the extended properties set.

The reason it is not included at this moment is that ASCII properties can be easily expressed using the existing set of features: ['a'..'z', 'A'..'Z'] or #[define(ASCII_ALPHA = ['a'..'z', 'A'..'Z'])]. In the current design, I deliberately keep the API minimalistic to let users define their own domain-specific symbols through the #[define(...)] syntax, which would better fit their programming languages design. The reason the Unicode classes (such as $upper or $alpha) have been introduced is that they cannot be expressed otherwise.

But I like the idea of Pest's Script Properties. I will consider introducing a similar set of properties with better name prefixes in the next major update of Lady Deirdre.

Eliah-Lakhin / lady-deirdre

Is it possible support more unicode regex rule? #14