Text encoding issue - Githubissues

GoogleCodeExporter commented 8 years ago

There is a comment on MacUpdate that QLCC only works properly for latin1 
encoded text.  
Highlight has the -u option, described as "define output encoding which matches 
input file 
encoding; omit encoding info if enc=NONE".  I could just set that to UTF-8, but 
then we would have 
problems with other encodings.  I think there's a reasonable default policy of 
"try UTF-8, fall back 
to latin1 if that fails."  This is not really my area of expertise, so feedback 
would be appreciated.

Original issue reported on code.google.com by n8gray@gmail.com on 7 Jan 2008 at 10:28

GoogleCodeExporter commented 8 years ago

I think the lack of UTF-8 interpretation is also why the UTF-8 BOM is visible 
(shown as "ï»¿" in the beginning of 
the UTF-8 files).

Original comment by tik...@gmail.com on 25 Jan 2008 at 1:15

GoogleCodeExporter commented 8 years ago

In SVN I've added a textEncoding option with default UTF-8.  This is less than 
perfect, I realize, but it'll at least 
allow people to set QLCC to handle the encoding they view most often.

Original comment by n8gray@gmail.com on 2 Apr 2008 at 9:40

Changed state: Started

GoogleCodeExporter commented 8 years ago

n8gray, great, but WebKit needs to be pointed out to encoding.
Here's the patch to make it work.

Instead of emptydict, we pass a dictionary with 
kQLPreviewPropertyTextEncodingNameKey set to default 
encoding (or UTF-8 if none).

Works great for me.

Original comment by dch...@gmail.com on 31 Jul 2008 at 11:02

Attachments:

patch-GeneratePreviewForURL.diff

GoogleCodeExporter commented 8 years ago

I committed a fix for this similar to what dchest suggested.  I used a 
different config variable 
"webkitTextEncoding" because I'm not sure that webkit and highlight recognize 
the same text encoding strings.  
Let me know if you're happy with the result (once I release it in a day or so).

Original comment by n8gray@gmail.com on 7 Jan 2009 at 10:32

GoogleCodeExporter commented 8 years ago

Perhaps some code from the file(1) command would be helpful?

http://www.opensource.apple.com/darwinsource/10.5.6/file-23/file/src/ascmagic.c

File(1) makes a good attempt to identify text as ASCII, UTF-8, UTF-16,
ISO-8859/latin1, extended ASCII, and (International) EBCDIC.

Another solution might be to exploit functionality in CoreServices's Text 
Encoding
Manager, which apprently includes an encoding sniffer:

http://developer.apple.com/documentation/Carbon/reference/Text_Encodin_sion_Mana
ger/Reference/reference.html

Original comment by adfergu...@gmail.com on 7 May 2009 at 2:05

GoogleCodeExporter commented 8 years ago

To remove the UTF-8 BOM you can invoke highlight using the --validate-input 
switch. 
This will also disable parsing of binary stuff.

Original comment by andre.si...@gmail.com on 26 Oct 2009 at 8:41

GoogleCodeExporter commented 8 years ago

I've added the --validate-input switch in git.  Thanks Andre!

Original comment by n8gray@gmail.com on 28 Oct 2009 at 6:09

GoogleCodeExporter commented 8 years ago

<code>/usr/bin/file</code> ships with MacOSX; no need to rip out anything. It's 
trivial to use it to detect the file encoding: the output from <code>file 
--mime-encoding -b $FILENAME</code> is the sought content encoding. This is a 
little highlight-to-utf8 shell script I wrapped up, that pipes the file through 
GNU recode to turn any text file into highlighted UTF-8:

<code>#! /bin/zsh
file="$1"
shift
ext=$(echo $file(:e)) 
enc=$(file --mime-encoding -b "$file") 
recode "$enc"..utf8 < $file | highlight -S "$ext" "$@"
</code>

(You can pass through options like -A to make ANSI instead of HTML output, if 
you're running it from a shell window.)

Original comment by oyas...@gmail.com on 7 Aug 2010 at 7:56

col / qlcolorcode

Text encoding issue #12