Encoding should be independent of locale

GoogleCodeExporter commented 8 years ago

What steps will reproduce the problem?

1. Create a file which contains non ascii-characters. A single character is
sufficient. You can use the attached ae.txt with this contents:
$ hexdump -C ae.txt
00000000  c3 a4                                             |..|
00000002 

2. Set a non-UTF8 locale:
$ LANG=

3. run pandoc:
$ pandoc -o ae.html ae.txt 

What is the expected output? What do you see instead?

This should work according to the
[Users Guide](http://johnmacfarlane.net/pandoc/README.html#character-encodings)
which says "All input is assumed to be in the UTF–8 encoding".
There isn't anything mentioned about locale dependency.

But it only works when the locale is set like this:
$ LANG=en_US.utf8 

What version of the product are you using? On what operating system?

$ uname -r
2.6.33-1.slh.5-sidux-amd64
$ pandoc --version
pandoc 1.5.1.1
Compiled with syntax highlighting support for: ... 
$ locale
LANG=
LC_CTYPE="POSIX"
LC_NUMERIC="POSIX"
LC_TIME="POSIX"
LC_COLLATE="POSIX"
LC_MONETARY="POSIX"
LC_MESSAGES="POSIX"
LC_PAPER="POSIX"
LC_NAME="POSIX"
LC_ADDRESS="POSIX"
LC_TELEPHONE="POSIX"
LC_MEASUREMENT="POSIX"
LC_IDENTIFICATION="POSIX"
LC_ALL= 
$ ghc --version
The Glorious Glasgow Haskell Compilation System, version 6.12.1
$ ghc-pkg list
/var/lib/ghc-6.12.1/package.conf.d
   Cabal-1.8.0.2
   GLUT-2.1.1.2
   HTTP-4000.0.6
   HUnit-1.2.2.1
   OpenGL-2.2.3.0
   QuickCheck-1.2.0.0
   array-0.3.0.0
   base-3.0.3.2
   base-4.2.0.0
   bin-package-db-0.0.0.0
   bytestring-0.9.1.5
   cairo-0.10.1
   cgi-3001.1.7.1
   containers-0.3.0.0
   directory-1.0.1.0
   dph-base-0.4.0
   dph-par-0.4.0
   dph-prim-interface-0.4.0
   dph-prim-par-0.4.0
   dph-prim-seq-0.4.0
   dph-seq-0.4.0
   editline-0.2.1.0
   extensible-exceptions-0.1.1.1
   fgl-5.4.2.2
   filepath-1.1.0.3
   ghc-6.12.1
   ghc-binary-0.5.0.2
   ghc-prim-0.2.0.0
   glade-0.10.1
   glib-0.10.1
   gtk-0.10.1
   haskell-src-1.0.1.3
   haskell98-1.0.1.1
   hpc-0.5.0.4
   html-1.0.1.2
   integer-gmp-0.2.0.0
   mtl-1.1.0.2
   network-2.2.1.7
   old-locale-1.0.0.2
   old-time-1.0.0.3
   parallel-1.1.0.1
   parsec-2.1.0.1
   pretty-1.0.1.1
   process-1.0.1.2
   random-1.0.0.2
   regex-base-0.93.1
   regex-compat-0.92
   regex-posix-0.93.2
   rts-1.0
   stm-2.1.1.2
   syb-0.1.0.2
   template-haskell-2.4.0.0
   time-1.1.4
   unix-2.4.0.0
   utf8-string-0.3.4
   xhtml-3000.2.0.1
   zlib-0.5.2.0
/home/sebastian/.ghc/x86_64-linux-6.12.1/package.conf.d
   binary-0.5.0.2
   digest-0.0.0.8
   highlighting-kate-0.2.6.2
   json-0.4.3
   pandoc-1.5.1.1
   regex-pcre-builtin-0.94.2.1.7.7
   texmath-0.2.0.3
   xml-1.3.5
   zip-archive-0.1.1.6 

Please provide any additional information below.

see also:
http://groups.google.com/group/pandoc-discuss/browse_thread/thread/8bfb53fb1b59b
d1b

Original issue reported on code.google.com by Sebastia...@googlemail.com on 18 Apr 2010 at 9:55

Attachments:

ae.txt

GoogleCodeExporter commented 8 years ago

I forgot to include the error message. Here it is:

$ pandoc -o ae.html ae.txt
pandoc: ae.txt: hGetContents: invalid argument (Invalid or incomplete
multibyte or wide character)

Original comment by Sebastia...@googlemail.com on 18 Apr 2010 at 9:57

GoogleCodeExporter commented 8 years ago

See the following from the pandoc man page (and README):

       Pandoc uses the UTF–8 character encoding  for  both  input  and  output
       (unless  compiled  with  GHC  6.12 or higher, in which case it uses the
       local encoding). 

I'm assuming your pandoc was compiled with GHC 6.12.  We're in a transitional 
phase;
once GHC 6.12 is well established, we should be able to get rid of the 
statement that
pandoc uses UTF-8 for input and output.

Of course, an alternative would be to keep this behavior, even when compiled 
with GHC
6.12.  I'm not sure which is better.

Original comment by fiddloso...@gmail.com on 19 Apr 2010 at 3:34

GoogleCodeExporter commented 8 years ago

Please keep the behaviour to use always UTF-8. This way, you can read files from
other users, no matter what locale they have.
Please remove the locale dependence as soon as possible, so users don't start
creating markdown documents with non-UTF-8 encodings.

Original comment by Sebastia...@googlemail.com on 19 Apr 2010 at 9:50

GoogleCodeExporter commented 8 years ago

+1 to UTF-8. People are, in general, uninformed about encodings. The only sane 
solution 
is to use a fixed encoding everywhere, and UTF-8 seems to be the de facto 
choice. It is 
used, by default, majority of modern text editors etc. There's absolutely no 
advantage 
of not using UTF-8.

Original comment by joonas.p...@gmail.com on 20 Apr 2010 at 6:34

GoogleCodeExporter commented 8 years ago

We're using pandoc to generate the documents for an open source software 
project. Our
documents are UTF-8 encoded so that's how they should be interpreted, 
regardless of
the locale setting of the user who is building our software (they didn't write 
the
document, we did). So, at the least, I would like to have an option to force the
input encoding to UTF-8.

Original comment by noval...@gmail.com on 20 Apr 2010 at 7:19

GoogleCodeExporter commented 8 years ago

Resolved in fb201a5b46bb49aa57a8462d7ded8ea2ff76be81
Pandoc now assumes UTF-8 in input, and produces UTF-8 in output, no matter what 
the locale -- just as it did 
before GHC 6.12 came around.

Original comment by fiddloso...@gmail.com on 7 May 2010 at 6:07

Changed state: Fixed

anammari / pandoc

Encoding should be independent of locale #233