anderslanglands / ustr

Fast, FFI-friendly string interning for Rust
Other
151 stars 26 forks source link

Case-insensitive strings #38

Open briot opened 11 months ago

briot commented 11 months ago

Hello, I am implementing a tool that deals with case-insensitive programming languages (Ada in particular, but also a custom DSL from another company). I wonder whether you have given any thoughts as to supporting such a use case ?

Given a &str as read from the source code, with any casing, we should get the same ustr, preferably without requiring memory allocations except of course when this is a new string.

Thanks Emmanuel

anderslanglands commented 11 months ago

Hi, that's not a use case I'd thought of!

Doing a quick bit of searching, the following bit from str::to_lowercase() makes me think this would not be easy to implement:

Since some characters can expand into multiple characters when changing the case, this function returns a String instead of modifying the parameter in-place.

briot commented 11 months ago

I am relatively beginning in Rust, but that was also my feeling... to_lowercase() allocates a new string indeed, so we would want to avoid it. Changing ustr API so that it doesn't take a &str but possibly a type that deref's to str, and that could provide its own case-insensitive equality is also not clear. Full unicode support for case-insensitive equality is definitely not trivial, and requires external libraries. In the case of programming languages, str::make_ascii_lowercase would mostly be good enough to detect keywords, but not enough for variable names for instance.

anderslanglands commented 11 months ago

Actually, thinking about it, the underlying representation doesn’t need to change, it’s just the point of Ustr creation that does. Seems like it ought to be possible to have a Ustr::case_insensitive() associated function that converts to lowercase before storage and panics if the str isn’t ascii…

anderslanglands commented 11 months ago

(If you’re ok with being limited to ascii that is)

briot commented 11 months ago

As I mentioned, ASCII is definitely fine in the case of keywords. Identifiers though can in theory include various unicode characters (I did not look up the exact rules yet, I must say). Think of "pi" for instance. I know of some code using French accented letters, or code using Russian names for variables. Mostly people are encouraged to use consistent casing but of course nothing forces them, and the compiler is pretty happy with that. So it would be nice if ustr supported that too.

One thing we do not need is preserve the original casing. So if I create a Ustr from "FOO" it is definitely ok if it is printed as "foo".

In practice, I think there should be a way to build the hash by iterating on each letter of the &str and converting each letter to lower-case on the fly (so no memory allocation, but computing the hash is slower :-() .
And then presumably replacing the "==" at line 110 of stringcache.rs so that it iterates over characters. Also slower than std::eq of course.

Because of the reduces performance, we need a way for users to opt-in for case-insensitive, and that likely should not be the default...

briot commented 11 months ago

Thank you for your interest in the subject ! :-)