demydd / pandoc

Automatically exported from code.google.com/p/pandoc
0 stars 0 forks source link

Umlauts in File > conversion into wrong html entities or not possible #84

Closed GoogleCodeExporter closed 8 years ago

GoogleCodeExporter commented 8 years ago
What steps will reproduce the problem?
1. Generate a text-file (2do.txt) containing the words "für Qualitätsplot" 
(among other things in 
markdown syntax)
2. issue "pandoc -s 2do.txt -o 2do.html" in terminal
3. everything works fine, expcept the html-file now contains "fŸr 
QualitŠtsplot"
4. issue "markdown2pdf 2do.txt -o 2do.pdf"
5. markdown2pdf doesn't want to convert anything
6. issue "pandoc -s 2do.txt -o 2do.tex"
7: TeX-file contains " fŸr QualitŠtsplot" 

What is the expected output? What do you see instead?
3. correct HTML entities
4: working conversion
7: f\"{u}r Qualit\"{a}tsplot

What version of the product are you using? On what operating system?
pandoc-0.46, downloaded today on OS X 10.5.4

Please provide any additional information below.

Original issue reported on code.google.com by david.haberthuer on 21 Aug 2008 at 11:19

GoogleCodeExporter commented 8 years ago
The output you're getting is definitely not correct, but I can't reproduce the
problem with pandoc 0.46 on my mac.  All of the conversions above work fine for 
me. 
(Note that pandoc will use UTF-8 characters, not entities, in HTML and LaTeX 
output
when possible.  This is to make things more readable for languages containing 
lots of
accented characters.  Pandoc's default HTML header specifies charset="UTF-8", 
and the
default LaTeX header includes the ucs and inputenc packages.)

How did you install pandoc-0.46?  Through macports?  Self-compiled?

Also, note that pandoc assumes a UTF-8 encoding.  That's the default in
OS X, but you can check by doing 'locale'.  LANG should be a string ending
in "UTF-8".

Original comment by fiddloso...@gmail.com on 21 Aug 2008 at 8:12

GoogleCodeExporter commented 8 years ago
i'm having some problems with the dependencies (haddock) while installing 
pandoc through macports, so i've 
compiled it myself.

altogether it's not too big of a problem, at least nothing a simple 
search&replace can't solve. my locale LANG is 
"e_CH.UTF-8", so I guess it should work.

an update on the html output. in the source it's actually "fŸr 
QualitŠtsplot"... the charset of the 
html-file is also set to "charset=UTF-8", so I don't know what's the problem 
exactly...

Original comment by david.haberthuer on 21 Aug 2008 at 9:59

GoogleCodeExporter commented 8 years ago
Hm.  Did you compile from the released tarball, or from SVN?

Original comment by fiddloso...@gmail.com on 22 Aug 2008 at 12:40

GoogleCodeExporter commented 8 years ago
i've downloaded it from here, the released tarball.

Original comment by david.haberthuer on 22 Aug 2008 at 7:19

GoogleCodeExporter commented 8 years ago
Unfortunately, I can't reproduce the problem compiling the same tarball on my 
mac. 
Can you try doing everything on the command line (to rule out the possibility 
that
your editor uses a nonstandard encoding?)

% ./pandoc
für Qualitätsplot
[Ctrl-D]

Original comment by fiddloso...@gmail.com on 22 Aug 2008 at 7:41

GoogleCodeExporter commented 8 years ago
I've been using TextEdit to edit the text-files before.
the output from the console is
---
loligo:~ habi$ pandoc
für Qualitätsplot gegenüber
<p
>für Qualitätsplot gegenüber</p
>
loligo:~ habi$ 
---
hope that helps

Original comment by david.haberthuer on 22 Aug 2008 at 9:54

GoogleCodeExporter commented 8 years ago
That output seems correct and renders correctly in my browser.
ü is u with an umlaut
ä is a with an umlaut

So it could be that the problem is with your editor?  Perhaps it's set up for an
encoding other than UTF-8?

Original comment by fiddloso...@gmail.com on 22 Aug 2008 at 2:54

GoogleCodeExporter commented 8 years ago
it's the standard TextEdit on OS X which - at least for the HTML saving options 
- is set to UTF-8...

Original comment by david.haberthuer on 23 Aug 2008 at 10:12

GoogleCodeExporter commented 8 years ago
The output of your command on the console is correct, so I don't see a problem 
there.
 Can you try the same thing with latex, just to make sure?

pandoc -w latex
für Qualitätsplot gegenüber

Then try:

pandoc -w latex -o test1.tex
für Qualitätsplot gegenüber
[Ctrl-D]
cat test1.tex

Finally try creating test2.txt with the text "für Qualitätsplot gegenüber"
and run it through pandoc, letting the output go to the terminal:

pandoc -w latex test2.txt

This should cover all the bases.

Original comment by fiddloso...@gmail.com on 23 Aug 2008 at 5:45

GoogleCodeExporter commented 8 years ago
did as you asked and it seems weirder and weirder from time to time:
---
loligo:~ habi$ pandoc -w latex
für Qualitätsplot gegenüber
[entered Ctrl-D]
für Qualitätsplot gegenüber

loligo:~ habi$ pandoc -w latex -o test1.tex
für Qualitätsplot gegenüber
[entered Ctrl-D]
loligo:~ habi$ cat test1.tex 
für Qualitätsplot gegenüber
loligo:~ habi$ pandoc -w latex test2.txt
für Qualitätsplot gegenüber

loligo:~ habi$ 
---
sooo, everything looks fine, but if i use
"pandoc -s untitled.txt -o untitled.tex"
i get "für qualitätsplot gegenüber" in the TeX-File.
this looks ugly, but at least converts to the correct pdf.
i've doublechecked the encoding in two different editors (smultron and 
TextEdit), but it doesn't seem to make 
a difference.
you can find all the files on http://habi.gna.ch/tmp/pandoc/ (text, tex and 
compiled pdf (compiled with 
TeXShop))

Original comment by david.haberthuer on 24 Aug 2008 at 12:07

GoogleCodeExporter commented 8 years ago
I think this problem has to do with the editors.  I've tried it with vim, 
emacs, and
TextMate on my mac -- they all show the UTF8 characters correctly.  TextEdit 
does
not, as you point out.  That is probably because TextEdit looks at mac-specified
metadata to figure out the encoding of the file.  (See
http://vnoel.wordpress.com/2008/06/18/weird-utf-8-bug-in-quicklook-its-the-ea/.)

I'm not, at this point, going to try to modify pandoc to set the mac-specific
metadata when it saves a file (I don't even think the Haskell file libraries 
provide
for this).  So, two solutions for you:

1.  Use another editor.  vim, emacs, TextMate all work fine.
2.  Set TextEdit (or Smultron) to use UTF8 by default.  Preferences -> Open and 
Save
-> Plain text file encoding.

I think that solves the puzzle, so I'll close this bug.

Original comment by fiddloso...@gmail.com on 24 Aug 2008 at 6:44