Open briot opened 11 months ago
Hi, that's not a use case I'd thought of!
Doing a quick bit of searching, the following bit from str::to_lowercase()
makes me think this would not be easy to implement:
Since some characters can expand into multiple characters when changing the case, this function returns a String instead of modifying the parameter in-place.
I am relatively beginning in Rust, but that was also my feeling... to_lowercase() allocates a new string indeed, so we would want to avoid it. Changing ustr API so that it doesn't take a &str but possibly a type that deref's to str, and that could provide its own case-insensitive equality is also not clear.
Full unicode support for case-insensitive equality is definitely not trivial, and requires external libraries. In the case of programming languages, str::make_ascii_lowercase
would mostly be good enough to detect keywords, but not enough for variable names for instance.
Actually, thinking about it, the underlying representation doesn’t need to change, it’s just the point of Ustr creation that does. Seems like it ought to be possible to have a Ustr::case_insensitive() associated function that converts to lowercase before storage and panics if the str isn’t ascii…
(If you’re ok with being limited to ascii that is)
As I mentioned, ASCII is definitely fine in the case of keywords. Identifiers though can in theory include various unicode characters (I did not look up the exact rules yet, I must say). Think of "pi" for instance. I know of some code using French accented letters, or code using Russian names for variables. Mostly people are encouraged to use consistent casing but of course nothing forces them, and the compiler is pretty happy with that. So it would be nice if ustr supported that too.
One thing we do not need is preserve the original casing. So if I create a Ustr from "FOO" it is definitely ok if it is printed as "foo".
In practice, I think there should be a way to build the hash by iterating on each letter of the &str and converting each letter to lower-case on the fly (so no memory allocation, but computing the hash is slower :-() .
And then presumably replacing the "==" at line 110 of stringcache.rs so that it iterates over characters. Also slower than std::eq of course.
Because of the reduces performance, we need a way for users to opt-in for case-insensitive, and that likely should not be the default...
Thank you for your interest in the subject ! :-)
Hello, I am implementing a tool that deals with case-insensitive programming languages (Ada in particular, but also a custom DSL from another company). I wonder whether you have given any thoughts as to supporting such a use case ?
Given a &str as read from the source code, with any casing, we should get the same ustr, preferably without requiring memory allocations except of course when this is a new string.
Thanks Emmanuel