Closed hoelzro closed 5 years ago
This definitely has an appeal should we detect to start detecting Unicode encodings, but I don't think we need to bother ignoring it otherwise. If I remember correctly, the BOM is a zero-width, non-print character, so it shouldn't crew anything up.
I want this very much. My use case is searching a code base for $work developed on Windows, in which files are either UTF-8 or UTF-16. I'm currently resorting to running iconv on all the UTF-16 files. I'm able and willing to submit a pull request, but I could use some guidance as to what the right solution would look like. I'm thinking something like "ack --encoding=BOM,utf8,UTF-16LE" means "if the file has a BOM, use that, otherwise try utf8; but, since I've specified one more possible encoding, first silently scan the entire file (assuming it's seekable, otherwise warn) for invalid utf8 byte sequences; if there are any, then assume UTF-16LE".
@kstarsinic Thanks for volunteering, but there are two things we need to address:
I'm very wary of going down this road, not least of which because I don't know anything about Unicode and I would be left to maintain code that I don't understand.
How does grep handle this situation?
grep looks at the current locale, and evaluates every file it visits as if it is encoded according to LC_CTYPE. However, grep requires that the system have a localedef(1) for the given locale defined (after browsing the source code, I believe that GNU "grep -P", when compiled with the 16-bit version of libpcre, may not need the localedef); I am unaware of any system on the planet that has a system definition for UTF-16.
I'm more than happy to provide more detail (up to and including succinct-yet-thorough documentation to include in the build tree). Let me know.
https://github.com/petdance/ack/issues/161