cdepillabout / highlight

command line tool for highlighting parts of files that match a regex
http://hackage.haskell.org/package/highlight
BSD 3-Clause "New" or "Revised" License
5 stars 1 forks source link

support utf8 regexes #5

Open cdepillabout opened 7 years ago

cdepillabout commented 7 years ago

It would be nice to support UTF8 regexes.

Here is an example of doing an UTF8 regex with grep:

utf-grep-example

Here is what happens when using highlight:

highlight-bad-utf8

Note that highlight is just working on a character-by-character basis, so it is possible to do a regex on japanese if you account for most japanese characters being 3 bytes in utf8:

highlight-good-utf8

When running the three previous examples, my locale settings are as follows:

$ locale
LANG=en_US.UTF-8
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=

Changing it to LC_ALL=en_US.ASCII makes grep ignore utf8 and output the same thing as highlight.

cdepillabout commented 7 years ago

Implementing this would require figuring out what options to pass to compileRegexWith.

It would be nice to make highlight work the same as grep. grep seems to assume ascii regexes when not running in a UTF8 locale. However I haven't thoroughly tested this.

If someone wanted to write-up how grep handles ASCII/UTF8 regexes based on the locale, that would be a big help. It should be beginner-friendly.