cboettig / knitcitations

:package: Generate citations for knitr markdown and html files
http://carlboettiger.info
Other
220 stars 28 forks source link

Error inserting reference with special characters #74

Open Pakillo opened 9 years ago

Pakillo commented 9 years ago

First, thanks a lot for the package - very useful!

Today I found a problem when inserting a reference by DOI. This is my Rmd:


output: pdf_document

bibliography: references.bib

library(knitcitations)
cleanbib()   
cite_options(citation_format = "pandoc")

Test: r citet("10.1111/j.1461-0248.2007.01060.x").

write.bibtex(file="references.bib")

which gives this error:

Error in utf8ToInt(x) : invalid UTF-8 string Calls: ... encoded_text_to_latex -> as.vector -> sapply -> lapply -> FUN -> utf8ToInt Execution halted

I think it may have something to do with the special characters in author names (Müller-Schärer)... Any clue how I could fix this? I couldn't find any help yet.

Many thanks in advance

Paco

My session info:

R version 3.1.3 (2015-03-09) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 7 x64 (build 7601) Service Pack 1

locale: [1] LC_COLLATE=English_United Kingdom.1252 LC_CTYPE=English_United Kingdom.1252
[3] LC_MONETARY=English_United Kingdom.1252 LC_NUMERIC=C
[5] LC_TIME=English_United Kingdom.1252

attached base packages: [1] stats graphics grDevices utils datasets methods base

other attached packages: [1] knitcitations_1.0.5

loaded via a namespace (and not attached): [1] bibtex_0.4.0 bitops_1.0-6 digest_0.6.8 htmltools_0.2.6 httr_0.6.1
[6] lubridate_1.3.3 memoise_0.2.1 plyr_1.8.1 Rcpp_0.11.5 RCurl_1.95-4.5
[11] RefManageR_0.8.45 RJSONIO_1.3-0 rmarkdown_0.5.1 stringr_0.6.2 tools_3.1.3
[16] XML_3.98-1.1 yaml_2.1.13

cboettig commented 9 years ago

I cannot reproduce this error; everything works fine with this citation on my end (See my sessionInfo() below). Looks like it is probably due to your locales -- I don't recognize your locale settings (I'm not familiar with Windows locales, but see ?Sys.setlocale; locales are responsible for how such special characters are parsed.

Here's my whole session:

> library(knitcitations)
> cleanbib()   
> cite_options(citation_format = "pandoc")
> citet("10.1111/j.1461-0248.2007.01060.x")
[1] "@Broennimann_2007"
> write.bibtex(file="references.bib")
Writing 1 Bibtex entries ... OK
Results written to file 'references.bib'
> sessionInfo()
R version 3.1.3 RC (2015-03-06 r67947)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Debian GNU/Linux 8 (jessie)

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] knitcitations_1.0.5 bibtex_0.4.0        RefManageR_0.8.45  

loaded via a namespace (and not attached):
 [1] bitops_1.0-6    digest_0.6.8    httr_0.6.1      lubridate_1.3.3
 [5] memoise_0.2.1   plyr_1.8.1      Rcpp_0.11.5     RCurl_1.95-4.5 
 [9] RJSONIO_1.3-0   stringr_0.6.2   tools_3.1.3     XML_3.98-1.1   
Pakillo commented 9 years ago

Thanks for the quick reply!

The problem occurs specifically when calling write.bibtex (which in turn calls RefManageR by @mwmclean ):

> library(knitcitations)
> cleanbib()
> cite_options(citation_format = "pandoc")
> citet("10.1111/j.1461-0248.2007.01060.x")
[1] "@Broennimann_2007"
> write.bibtex(file="references.bib")
Writing 1 Bibtex entries ... Error in utf8ToInt(x) : invalid UTF-8 string
> traceback()
14: utf8ToInt(x)
13: FUN("O. Broennimann and U. A. Treier and H. Müller-Schärer and W. Thuiller and A. T. Peterson and A. Guisan"[[1L]], 
        ...)
12: lapply(X = X, FUN = FUN, ...)
11: sapply(x, do_utf8)
10: as.vector(switch(encoding, latin1 = sapply(x, do_latin1), latin2 = sapply(x, 
        do_latin2), latin9 = sapply(x, do_latin9), `UTF-8` = sapply(x, 
        do_utf8), utf8 = sapply(x, do_utf8), stop("unimplemented encoding")))
9: encoded_text_to_latex(format_author(object[[i]]), "UTF-8")
8: FUN(X[[1L]], ...)
7: lapply(object, format_bibentry1)
6: unlist(lapply(object, format_bibentry1))
5: head(unlist(lapply(object, format_bibentry1)), -1L)
4: toBiblatex(bib, ...)
3: writeLines(toBiblatex(bib, ...), fh)
2: WriteBib(entry, file = file, append = append, ...)
1: write.bibtex(file = "references.bib")

I will investigate with locales. You're probably right that's the root of the problem (e.g. see http://stackoverflow.com/questions/5205159/how-can-i-find-out-the-internal-code-representation-of-a-windows-1252-character).

I'll let you know if I manage to fix it.

Thanks!

Pakillo commented 9 years ago

Well, it seems Windows doesn't help to get this sorted... But at least I managed to make it work following your suggestion of changing locales: Sys.setlocale("LC_ALL", locale = "C"). Although the special characters are still not parsed correctly in the final reference, at least the pdf is produced now.

I paste here the code in case other Windows users find it useful, or someone finds a better solution:

> library(knitcitations)
> cleanbib()
> cite_options(citation_format = "pandoc")
> Sys.setlocale("LC_ALL", locale = "C")
[1] "C"
> citet("10.1111/j.1461-0248.2007.01060.x")
[1] "@Broennimann_2007"
> write.bibtex(file="references.bib")
Writing 1 Bibtex entries ... OK
Results written to file 'references.bib'

The special characters (ü) are not parsed correctly:

Broennimann, O., U. A. Treier, H. **M<U+00FC>ller-Sch<U+00E4>rer**, W. Thuiller, A. T. Peterson, and A. Guisan. 2007. “Evidence of Climatic Niche Shift During Biological Invasion.” Ecol Letters 10 (8). Wiley-Blackwell: 701–9.

So it's not perfect, but at least works. Thanks again for your help

cboettig commented 9 years ago

No problem. Windows really should have some locale that supports UTF-8 -- have you tried asking on stackoverflow on this?

On Wed, Mar 18, 2015 at 11:01 AM Francisco Rodriguez-Sanchez < notifications@github.com> wrote:

Well, it seems Windows doesn't help to get this sorted... But at least I managed to make it work following your suggestion of changing locales: Sys.setlocale("LC_ALL", locale = "C"). Although the special characters are still not parsed correctly in the final reference, at least the pdf is produced now.

I paste here the code in case other Windows users find it useful, or someone finds a better solution:

library(knitcitations) cleanbib() cite_options(citation_format = "pandoc") Sys.setlocale("LC_ALL", locale = "C") [1] "C" citet("10.1111/j.1461-0248.2007.01060.x") [1] "@Broennimann_2007" write.bibtex(file="references.bib") Writing 1 Bibtex entries ... OK Results written to file 'references.bib'

The special characters (ü) are not parsed correctly:

Broennimann, O., U. A. Treier, H. M<U+00FC>ller-Sch<U+00E4>rer, W. Thuiller, A. T. Peterson, and A. Guisan. 2007. “Evidence of Climatic Niche Shift During Biological Invasion.” Ecol Letters 10 (8). Wiley-Blackwell: 701–9.

So it's not perfect, but at least works. Thanks again for your help

— Reply to this email directly or view it on GitHub https://github.com/cboettig/knitcitations/issues/74#issuecomment-83099572 .

Pakillo commented 9 years ago

Hi Carl,

An update on this issue. I have tried with many references and am not getting an error in write.bibtex anymore (maybe after upgrading to R 3.2.0?). So that's good :)

Errors still happen when pandoc attempts to produce final pdf with bibliography, in cases when some of the references produced by knitcitations contain 'strange' characters. An example:


output: pdf_document

bibliography: references.bib

library(knitcitations)
cleanbib()   
cite_options(citation_format = "pandoc")

This is a test r citet("10.1111/nph.12929").

References

write.bibtex(file="references.bib")

This Rmd is knitted to md successfully but then Rstudio gives the following error: `! Undefined control sequence. l.116 Francisco Rodr\iguez

pandoc.exe: Error producing PDF from TeX source Error: pandoc document conversion failed with error 43`

When you look at references.bib you can see that some authors names include strange characters: @Article{Gavin_2014, doi = {10.1111/nph.12929}, url = {http://dx.doi.org/10.1111/nph.12929}, year = {2014}, month = {jul}, publisher = {Wiley-Blackwell}, volume = {204}, number = {1}, pages = {37--54}, author = {Daniel G. Gavin and Matthew C. Fitzpatrick and Paul F. Gugger and Katy D. Heath and Francisco Rodr\'\iguez-S{\a'a}nchez and Solomon Z. Dobrowski and Arndt Hampe and Feng Sheng Hu and Michael B. Ashcroft and Patrick J. Bartlein and Jessica L. Blois and Bryan C. Carstens and Edward B. Davis and Guillaume {de Lafontaine} and Mary E. Edwards and Matias Fernandez and Paul D. Henne and Erin M. Herring and Zachary A. Holden and Woo-seok Kong and Jianquan Liu and Donatella Magri and Nicholas J. Matzke and Matt S. McGlone and Fr{\a'e}d{\a'e}rik Saltr{\a'e} and Alycia L. Stigall and Yi-Hsin Erica Tsai and John W. Williams}, title = {Climate refugia: joint inference from fossil records, species distribution models and phylogeography}, journal = {New Phytologist}, }

which are causing these errors.

Anyway, I just wanted to update you and let you know that write.bibtex works fine now, even though I'm still getting errors later (with pandoc). But at least now it's not that difficult to correct these weird characters in the references.bib file manually before calling pandoc.

I'll come back if I find a solution to this. Feel free to close this issue if you think it's not related to knitcitations anymore.

Thanks!

elbamos commented 8 years ago

Guys, was there ever a resolution to this? As @rudolfli's link shows, this is causing an issue downstream. If there's a workaround, I can implement it in that package?

Pakillo commented 8 years ago

Hi,

I recall it was a Windows-specific issue, hard to solve because of Windows intrinsic limitations (lots of threads on stack overflow about UTF-8 and Windows). What I tried was post-processing the bibtex references as downloaded by knitcitations to remove the special characters before being processed by pandoc.

I paste below the function I made to go over all references and convert problematic fields to UTF-8 (using iconv); but I don't remember if it worked fine in all cases:

#' Encode author names in UTF-8.
#'
#' Encode author names in BibEntry objects as UTF-8. Specially useful when working in Windows systems that do not support UTF-8.
#'
#' @import RefManageR
#' @param refs A BibEntry object.
#' @export
#' @return A BibEntry object.
#' @examples \dontrun{
#' library(knitcitations)
#' cleanbib()
#' cite_options(citation_format = "pandoc")
#' #citet("10.1111/nph.12929") # doesn't work
#' citep("10.1016/j.tree.2006.09.010")
#' citet("10.1111/j.1461-0248.2007.01060.x")
#' ref <- knitcitations:::get_bib()
#' ref.utf8 <- BibEntry_to_UTF8(ref)
#'
#'}

BibEntry_to_UTF8 <- function(refs){

  for (i in 1:length(refs)){
    authors <- paste(refs[[i]]$author, collapse = " and ")
    refs[[i]]$author <- iconv(authors, to = "UTF-8")
  }

  for (i in 1:length(refs)){
    refs[[i]]$title <- iconv(refs[[i]]$title, to = "UTF-8")
  }

  for (i in 1:length(refs)){
    refs[[i]]$journal <- iconv(refs[[i]]$journal, to = "UTF-8")
  }

  refs

}

Hope this helps somehow. I'd be grateful if you find a solution to this!

billdenney commented 4 years ago

I hit the same issue here today, but I don't quite follow why it's an issue that cannot be resolved. When I run readLines() on Windows without specifying the encoding, I have problems with the Unicode characters, but when I run readLines() with the encoding, I get the expected characters. Unfortunately, I don't see a way to give the text output from readLines() to citep().

Rename this from .txt to .bib: Janssen_2013.txt

# Bad
readLines("Janssen_2013.bib")
# Good
readLines("Janssen_2013.bib", encoding="UTF-8")