check for BOM to detect unicode encoding

beyondgrep / ack2

**ack 2 is no longer being maintained. ack 3 is the latest version.**

https://github.com/beyondgrep/ack3/

Other

1.48k stars 140 forks source link

check for BOM to detect unicode encoding #344

Closed hoelzro closed 5 years ago

hoelzro commented 11 years ago

https://github.com/petdance/ack/issues/161

hoelzro commented 11 years ago

This definitely has an appeal should we detect to start detecting Unicode encodings, but I don't think we need to bother ignoring it otherwise. If I remember correctly, the BOM is a zero-width, non-print character, so it shouldn't crew anything up.

kstarsinic commented 9 years ago

I want this very much. My use case is searching a code base for $work developed on Windows, in which files are either UTF-8 or UTF-16. I'm currently resorting to running iconv on all the UTF-16 files. I'm able and willing to submit a pull request, but I could use some guidance as to what the right solution would look like. I'm thinking something like "ack --encoding=BOM,utf8,UTF-16LE" means "if the file has a BOM, use that, otherwise try utf8; but, since I've specified one more possible encoding, first silently scan the entire file (assuming it's seekable, otherwise warn) for invalid utf8 byte sequences; if there are any, then assume UTF-16LE".

hoelzro commented 9 years ago

@kstarsinic Thanks for volunteering, but there are two things we need to address:

First, we're in sort of a code freeze right now; @petdance is busy prepping for a 2.16 release.
Second, if we're going to add this sort of thing to Ack, we should probably take the plunge and support Unicode all the way. That's a huge undertaking, and I don't even know if it fits within Ack's scope. That's up to @petdance. Granted, I think you provided a pretty good real world example of how that kind of functionality could be used.

petdance commented 9 years ago

I'm very wary of going down this road, not least of which because I don't know anything about Unicode and I would be left to maintain code that I don't understand.

How does grep handle this situation?

kstarsinic commented 9 years ago

grep looks at the current locale, and evaluates every file it visits as if it is encoded according to LC_CTYPE. However, grep requires that the system have a localedef(1) for the given locale defined (after browsing the source code, I believe that GNU "grep -P", when compiled with the 16-bit version of libpcre, may not need the localedef); I am unaware of any system on the planet that has a system definition for UTF-16.

I'm more than happy to provide more detail (up to and including succinct-yet-thorough documentation to include in the build tree). Let me know.