Closed maurolepore closed 4 years ago
These are the characters that seem to get messed up (non-breaking space and curly quotes)... Unicodecode point | character | UTF-8(hex.) | name |
---|---|---|---|
U+00A0 | c2 a0 | NO-BREAK SPACE | |
U+2018 | ‘ | e2 80 98 | LEFT SINGLE QUOTATION MARK |
U+2019 | ’ | e2 80 99 | RIGHT SINGLE QUOTATION MARK |
Seems like when README.md gets processed into README.Rmd, those characters are converted into something appropriate. But when converted to index.html, they get converted improperly.
So that points to the gh-pages process, where a virtual server is spun-up and the pkgdown
package does its magic. Will take some time to dig into that.
Locally, I can run pkgdown::build_site()
on my macOS system, and those characters come out properly in the index.html
. So maybe something in the workflow file pkgdown.yaml needs to be adjusted to make sure the R environment there is in utf-8?
Thanks a lot! That brings me some ideas to fix it. You did enough.
Right now I build the site on github actions. I think the action used macos and I changed it to ubuntu. I see the problem locally on my ubuntu. So I may fix quickly by building on macos, then think for a solution on ubuntu.
We discussed on slack and these are some notes:
I did a bunch of hunting around and... I'm fairly certain this is caused by a bug in xml2::read_html
which reads a string literal in UTF-8 correctly, but reads a file incorrectly...
# pkgdown::deploy_to_branch
# ↳ pkgdown::build_site
# ↳ pkgdown:::build_site_local
# ↳ pkgdown::build_home
# ↳ pkgdown:::build_home_md
# ↳ pkgdown:::render_md
# ↳ pkgdown:::markdown
# ↳ xml2:::read_html
# ↳ xml2:::read_html.default
text <- "<body>brûlée 鬼 test 'stuff' and ‘PACTA’ 2°C €</body>"
f <- tempfile()
utf8 <- enc2utf8(text)
con <- file(f, open = "w+", encoding = "native.enc")
writeLines(utf8, con = con, useBytes = TRUE)
close(con)
readLines(f, encoding = "UTF-8")
#> [1] "<body>brûlée 鬼 test 'stuff' and ‘PACTA’ 2°C €</body>"
xml2::read_html(text)
#> {html_document}
#> <html>
#> [1] <body>brûlée 鬼 test 'stuff' and ‘PACTA’ 2°C €</body>
xml2::read_html(f)
#> {html_document}
#> <html>
#> [1] <body>brûlée 鬼 test 'stuff' and â\u0080\u0098PACTAâ\u0080\u0099 2°C ...
unlink(f)
text <- "<body>brûlée 鬼 test 'stuff' and ‘PACTA’ 2°C €</body>"
xml2::write_html(xml2::read_html(text), 'test.html')
xml2::read_html('test.html')
#> {html_document}
#> <html>
#> [1] <body>brûlée 鬼 test 'stuff' and â\u0080\u0098PACTAâ\u0080\u0099 2°C ...
readLines('test.html', encoding = "UTF-8")
#> [1] "<!DOCTYPE html PUBLIC \"-//W3C//DTD HTML 4.0 Transitional//EN\" \"http://www.w3.org/TR/REC-html40/loose.dtd\">"
#> [2] "<html><body>brûlée 鬼 test 'stuff' and ‘PACTA’ 2°C €</body></html>"
and this is my pretty minimal reprex of this issue...
install.packages('usethis')
install.packages('pkgdown')
usethis::create_package(getwd(), fields = NULL, rstudio = FALSE, open = FALSE)
usethis::use_readme_md(open = FALSE)
write("brûlée 鬼 test 'stuff' and ‘PACTA’ 2°C €", file = "README.md", append = TRUE)
usethis::use_pkgdown()
pkgdown::build_home(preview = TRUE)
On my macOS, the resulting index.html
looks ok.
On a brand new RStudio Cloud instance, I get the same encoding garbling.
On Windows 10, I get this error...
> pkgdown::build_home(preview = TRUE)
-- Building home ---------------------------------------------------------------
Writing 'authors.html'
UTF-8 decoding error in C:/Users/cjyetman/Documents/test4/README.md at byte offset 390 (fb).
The input must be a UTF-8 encoded text.
Error: pandoc document conversion failed with error 92
Error: [ENOENT] Failed to remove 'C:/Users/cjyetman/AppData/Local/Temp/RtmpSmRuLn/file71036f51e07.html': no such file or directory
🤷♂
here's an even more minimal reprex that still mangles the '...' in the example README that it creates when run on RStudio Cloud...
usethis::create_package(getwd(), fields = NULL, rstudio = FALSE, open = FALSE)
usethis::use_readme_md(open = FALSE)
pkgdown::build_home(preview = TRUE)
That's awesome! Thanks CJ! Do you plan to open an issue in xml2?
(I reopen because I closed unintentionally via a sloppy use of the word "fix" in a commit message.)
If you have the bandwidth to do it, please feel free to use this reprex. Not sure if/when I'll get around to it. Feel like I maxed out my time to screw around with this today for the next week or so. 😉
actually looks like it's a regression in xml2 v1.3.0 and it's already been reported... https://github.com/r-lib/xml2/issues/293 https://github.com/r-lib/pkgdown/issues/1284 https://github.com/r-lib/pkgdown/issues/1287
Best case scenario ;)
Looks like this is probably fixed in https://github.com/r-lib/xml2/commit/654385789de3ac2ee08d05555a32147eb8a12457
and on CRAN already xml2 v131
https://2degreesinvesting.github.io/r2dii.match/
Relates to https://github.com/2DegreesInvesting/r2dii.data/issues/36