Refactor to fully support UTF-8

andylizi commented 1 year ago

This PR is partially based on #8. When I was testing the UTF-8 fix introduced in #8, I accidently found two entirely new ways for the code to slice inside char boundaries and panic. Feeling the current approach was too brittle, I wanted to try my hand at addressing the issue in a more fundamental way, and this is the result. I know it's a big change and you might prefer something less, er, overhaul-ish :sweat_smile:.

Unfortunately I forgot how to reproduce the aforementioned bugs, since it's been a few days……But I have added unicode tests for all situations I can think of, so it should be alright…

The only breaking change introduced is making write_source_highlighting() accept &str, since logically we can't really format non-UTF8 error messages without significantly more effort anyway.

This is currently marked draft, because I think I can find a way to get rid of the bstr dependency. I'm also a bit unhappy with how error positions are tracked with this approach, and would love to hear any suggestions.

de-vri-es commented 1 year ago

Heh, certainly a big overhaul. I need to look at the details more closely when I have more time.

de-vri-es commented 1 year ago

I've copied your test cases to #8 and fixed the resulting panics with a much smaller diff. I hope you understand why I prefer that approach over a big rewrite of the algorithm itself.

I'm not opposed to changes to the algorithm if there is a clear need for it, but in this case I don't think big changes are needed to fix UTF-8 bugs.

I'd still like to add support for multibyte UTF-8 sequences in variable names, but I think that can be a separate change.

I made sure to preserve you as author for the copied test cases and the #[inline] attributes.

Do you think I missed something important?

andylizi commented 1 year ago

Totally understandable! I left the comments on the other PR.

fizyr / subst

Refactor to fully support UTF-8 #9