andrewheiss / SublimeKnitr

Plugin that adds knitr Markdown and LaTeX support in Sublime Text 2 and 3
71 stars 11 forks source link

Foreign characters in R figures don't work #17

Closed andrewheiss closed 10 years ago

andrewheiss commented 10 years ago

Knitr is apparently really picky about the encoding of the files it builds. If you try to build a file with Unicode characters in a plot using this plugin, R will choke on the characters and return them either as .. or their Unicode code.

Here's a minimal working example (.Rmd):


---
title: "Test"
output: html_document

---

Testing

```{r, echo=FALSE}
plot(cars, main="pučina")

Running this with this plugin in ST3 will result in the following error: 

Warning message: In native_encode(text) : some characters may not work under the current locale



According to [this](http://stackoverflow.com/questions/15926890/non-english-special-characters-in-knitr), knitr can have the file encoding passed through the `knit()` command, but it has to match the encoding of either the file itself or the system default. Hardcoding `knit(…, encoding='UTF8')` into this plugin's build system isn't recommended, since Windows doesn't play well with UTF8 (apparently) and since it's supposed to match the encoding of the file. Or something.

RStudio gets it right, but that's in part because they've hired Yihui :)

Any ideas on how to run the correct `knit()` command from ST?
ghost commented 10 years ago

Try adding the env variable to the .sublime-build. @randy3k suggested it to me, along with other possible solutions, here regarding my own, quite related, encoding issues involving knitr and Sublime builds, and it has worked like a charm. My own .sublime-build variant now looks like this:

  "variants":
  [
  {
    "name": "Run",
    "working_dir": "$file_path",
    "env": { "LANG": "en_US.UTF-8" },
    "shell_cmd": "Rscript -e \"rmarkdown::render(input = '$file')\""
  }
  ]

With this, I am able to successfully rmarkdown::render() your example, although I do get a few warnings in the rendered document:

captura de pantalla 2014-06-30 a la s 22 09 03

Trying a simple knit() after having added the same env variable to SublimeKnitr's default .sublime-build also seems to sort of work, printing the same warnings in the resulting document:

---
title: "Test"
output: html_document
---

Testing

Warning: conversion failure on 'pučina' in 'mbcsToSbcs': dot substituted for

Warning: conversion failure on 'pučina' in 'mbcsToSbcs': dot substituted for <8d>


![plot of chunk unnamed-chunk-1](figure/unnamed-chunk-1.png) 
ghost commented 10 years ago

I guess I should mention that I'm on a Mac; I'm not really sure if this is relevant for folk on Windows.

andrewheiss commented 10 years ago

Ooh, this looks promising. I've been toying around with it for the past hour, trying to get rid of the conversion failure warnings, but to no avail. It's a common problem for R graphics and knitr apparently (see the Encoding of multibyte characters section at the knitr manual). It looks like you can take care of the problem by manually specifying an encoding, but there's no UTF-8 encoding (apparently), so I don't know how to best generalize it. I'd love to know how RStudio does it.

ghost commented 10 years ago

Try adding "env": { "LANG": "en_US.UTF-8" } to the default .sublime-build and adding the following chunk before the chunk included in your test document:

```{r, echo = FALSE}
pdf.options(encoding = 'CP1250')


How does that work? It seems to have gotten rid of the conversion warnings for me. Cf. [this question](http://stackoverflow.com/questions/13251665/rhtml-warning-conversion-failure-on-var-in-mbcstosbcs-dot-substituted-f) on Stack Overflow.
ghost commented 10 years ago

Using encoding = 'ISOLatin2', instead of encoding = 'CP1250', also seems to work for me.

andrewheiss commented 10 years ago

Fantastic - that works!

The only downside to this is that the user has to select an encoding that fits all the characters they're using in their document. If they use Chinese, Arabic, or Cyrillic characters, they'll need to change it accordingly.

andrewheiss commented 10 years ago

However, I just tested it in RStudio and it has the same problem (and same solution; setting pdf.options() in a chunk). So RStudio doesn't have a magic way to make this work—it's subject to the same encoding wonkiness in PDF images.

andrewheiss commented 10 years ago

So, for future reference, adding a separate block with pdf.options() will work. Here's a minimal working example:

---
title: "Test"
output: html_document
---

Testing

```{r, echo=FALSE}
pdf.options(encoding='ISOLatin2')
plot(cars, main="pučina")
ghost commented 10 years ago

Maybe this should be a separate issue, or maybe even this enters more into the jurisdiction of @LaTeXing, but it is directly related to the foregoing discussion, so I'll just add it here for the moment.

The solution above for Rmd documents does not seem to work for Rtex/Rnw/etc., where "č" and other non-English characters are rendered as ".." or as Unicode; admittedly, I have yet to manage to successfully incorporate the env variable into the .sublime-build.

Input:

\documentclass{article}

\title{Test}
\date{}

%% begin.rcode, 'set-up', include = FALSE
% pdf.options(encoding = 'ISOLatin2')
%% end.rcode

\begin{document}

\maketitle

Testing

%% begin.rcode, 'test_1', echo = FALSE
% plot(cars, main = "pučina")
%% end.rcode

%% begin.rcode, 'test_2'
% print('¡Qué tranza o qué!')
%% end.rcode

\end{document}

Output:

captura de pantalla 2014-07-01 a la s 21 20 40

andrewheiss commented 10 years ago

Yes, this.

andrewheiss commented 10 years ago

I've been working with another person (not on GitHub) with this exact issue (..s in .Rnw files). He asked a SO question and got an answer that said he should use Cairo, but it's a clunky solution and renders PDFs differently.

However, I don't know if this is a knitr issue. When he runs knitr from the Terminal, everything works great and all characters show up as expected. Building the .Rnw file from ST is where encoding messes up. Perhaps adding "env": { "LANG": "en_US.UTF-8" }, to the LaTeXTools or LaTeXing build systems will make it work right?

ghost commented 10 years ago

I think you may be right about the issue being due to ST rather than to knitr, although I don't know much at all. In my encoding-related question on SO, @randy3k in a comment suggests I run:

import subprocess; print(subprocess.check_output("R -q -e 'Sys.getlocale()'", shell=True).decode('utf8'))

in ST's console and comparing the results with those gleaned from running, in the terminal:

R -q -e 'Sys.getlocale()'

It seems that, for me at least, there is some sort of disconnect (but, again, I don't know much on the subject): ST yields "C", while my terminal gives me "C/UTF-8/C/C/C/C".

Adding "env": { "LANG": "en_US.UTF-8" }, to my .sublime-build variant for .Rmd subsequently fixed that issue, for which reason I have indeed tried repeatedly to add it to @LaTeXing's .sublime-build. However, probably due to my own ineptitude, doing so has only resulted in a broken .sublime-build, i.e., that does nothing but save the open file (no compile, no knit, etc.).

andrewheiss commented 10 years ago

My terminal gives me [1] "en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8", while ST gives just [1] "C".

But after creating ~/.Renviron and adding LANG=en_US.UTF-8, ST gives [1] "en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8"

Try doing that and see if the .. problem persists.

ghost commented 10 years ago

Adding LANG=en_US.UTF-8 to ~/.Renviron seems to have mixed results for me1: "č" is rendered nicely without any warnings in the output .pdf, while "¡" and "é" are simply omitted, i.e., the Unicode code is no longer printed.

Input:

\documentclass{article}

\title{Test}
\date{}

%% begin.rcode, 'set-up', include = FALSE
% pdf.options(encoding = 'ISOLatin2')
%% end.rcode

\begin{document}

\maketitle

Testing. !` \'e

%% begin.rcode, 'test_1', echo = FALSE
% plot(cars, main = "pučina")
%% end.rcode

%% begin.rcode, 'test_2'
% print('¡Qué tranza o qué!')
%% end.rcode

\end{document}

Output:

captura de pantalla 2014-07-02 a la s 13 39 09

1 Running the bit of Python in ST gives me [1] "en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8" as well.

ghost commented 10 years ago

Compare with the rendered .html from .Rmd:

---
title: "Test"
output: html_document
---

Testing. ¡ é

```{r, 'set-up', include = FALSE}
pdf.options(encoding='ISOLatin2')
plot(cars, main="pučina")
print('¡Qué tranza o qué!')


![captura de pantalla 2014-07-02 a la s 13 42 06](https://cloud.githubusercontent.com/assets/6853773/3462066/95a2ab4c-0229-11e4-8723-3afb6cdee769.png)
andrewheiss commented 10 years ago

Oh, we're so close :)

The missing characters in the actual body of the PDF is probably due to LaTeX. Add this to the preamble: \usepackage[utf8]{inputenc}

ghost commented 10 years ago

That did it! Thanks very much.

\documentclass{article}
\usepackage[utf8]{inputenc} 

\title{Test}
\date{}

%% begin.rcode, 'set-up', include = FALSE
% pdf.options(encoding = 'ISOLatin2')
%% end.rcode

\begin{document}

\maketitle

Testing. !`¡\'eé

%% begin.rcode, 'test_1', echo = FALSE
% plot(cars, main = "pučina")
%% end.rcode

%% begin.rcode, 'test_2'
% print('¡Qué tranza o qué!')
%% end.rcode

\end{document}

captura de pantalla 2014-07-02 a la s 13 58 59

ghost commented 10 years ago

Summary

andrewheiss commented 10 years ago

Thanks so much for your help!

randy3k commented 10 years ago

very interesting discussion.

randy3k commented 10 years ago

Another possible way to suppress the warnings is to use another graphic device, e.g.,

<<include = FALSE>>=
options(device = "cairo_pdf")
@
andrewheiss commented 10 years ago

Yes, though I had someone else complain that the Cairo output wasn't as clear or nice looking as whatever R's default is.

randy3k commented 10 years ago

@mmarascio it is strange that "env": { "LANG": "en_US.UTF-8" } in sublime-build doesn't work for you but adding LANG=en_US.UTF-8 to ~/.Renviron works. I believe that they should be the same, at least in sublime environment. May be I am wrong.

ghost commented 10 years ago

@randy3k: I'm not sure I understand; both alternatives do seem to work for me (see this relevant comment). Only, in addition, for non-ASCII characters in R plots, I need the preliminary chunk that sets pdf.options and, for non-ASCII characters in knitr output in .Rnw - as should've been evident to me - I need \usepackage[utf8]{inputenc} in the document's preamble.

randy3k commented 10 years ago

I see. Thx for the clarification.