Fix encoding in website's home

maurolepore commented 4 years ago

https://2degreesinvesting.github.io/r2dii.match/

Relates to https://github.com/2DegreesInvesting/r2dii.data/issues/36

cjyetman commented 4 years ago

These are the characters that seem to get messed up (non-breaking space and curly quotes)... Unicodecode point	character	UTF-8(hex.)	name
U+00A0		c2 a0	NO-BREAK SPACE
U+2018	‘	e2 80 98	LEFT SINGLE QUOTATION MARK
U+2019	’	e2 80 99	RIGHT SINGLE QUOTATION MARK

Seems like when README.md gets processed into README.Rmd, those characters are converted into something appropriate. But when converted to index.html, they get converted improperly.

So that points to the gh-pages process, where a virtual server is spun-up and the pkgdown package does its magic. Will take some time to dig into that.

cjyetman commented 4 years ago

Locally, I can run pkgdown::build_site() on my macOS system, and those characters come out properly in the index.html. So maybe something in the workflow file pkgdown.yaml needs to be adjusted to make sure the R environment there is in utf-8?

maurolepore commented 4 years ago

Thanks a lot! That brings me some ideas to fix it. You did enough.

Right now I build the site on github actions. I think the action used macos and I changed it to ubuntu. I see the problem locally on my ubuntu. So I may fix quickly by building on macos, then think for a solution on ubuntu.

maurolepore commented 4 years ago

We discussed on slack and these are some notes:

My comment above is wrong. The CI uses macos so the problem is elsewhere.
A quick fix might be to edit the .md. Still left with the ultimate problem that causes the .Rmd to render oddly.

cjyetman commented 4 years ago

I did a bunch of hunting around and... I'm fairly certain this is caused by a bug in xml2::read_html which reads a string literal in UTF-8 correctly, but reads a file incorrectly...

# pkgdown::deploy_to_branch
#   ↳ pkgdown::build_site
#     ↳ pkgdown:::build_site_local
#       ↳ pkgdown::build_home
#         ↳ pkgdown:::build_home_md
#           ↳ pkgdown:::render_md
#             ↳ pkgdown:::markdown
#               ↳ xml2:::read_html
#                 ↳ xml2:::read_html.default

text <- "<body>brûlée 鬼 test 'stuff' and ‘PACTA’ 2°C &euro;</body>"
f <- tempfile()
utf8 <- enc2utf8(text)
con <- file(f, open = "w+", encoding = "native.enc")
writeLines(utf8, con = con, useBytes = TRUE)
close(con)
readLines(f, encoding = "UTF-8")
#> [1] "<body>brûlée 鬼 test 'stuff' and ‘PACTA’ 2°C &euro;</body>"
xml2::read_html(text)
#> {html_document}
#> <html>
#> [1] <body>brûlée 鬼 test 'stuff' and ‘PACTA’ 2°C €</body>
xml2::read_html(f)
#> {html_document}
#> <html>
#> [1] <body>brÃ»lÃ©e é¬¼ test 'stuff' and â\u0080\u0098PACTAâ\u0080\u0099 2Â°C  ...
unlink(f)

text <- "<body>brûlée 鬼 test 'stuff' and ‘PACTA’ 2°C &euro;</body>"
xml2::write_html(xml2::read_html(text), 'test.html')
xml2::read_html('test.html')
#> {html_document}
#> <html>
#> [1] <body>brÃ»lÃ©e é¬¼ test 'stuff' and â\u0080\u0098PACTAâ\u0080\u0099 2Â°C  ...
readLines('test.html', encoding = "UTF-8")
#> [1] "<!DOCTYPE html PUBLIC \"-//W3C//DTD HTML 4.0 Transitional//EN\" \"http://www.w3.org/TR/REC-html40/loose.dtd\">"
#> [2] "<html><body>brûlée 鬼 test 'stuff' and ‘PACTA’ 2°C €</body></html>"

cjyetman commented 4 years ago

and this is my pretty minimal reprex of this issue...

install.packages('usethis')
install.packages('pkgdown')
usethis::create_package(getwd(), fields = NULL, rstudio = FALSE, open = FALSE)

usethis::use_readme_md(open = FALSE)
write("brûlée 鬼 test 'stuff' and ‘PACTA’ 2°C &euro;", file = "README.md", append = TRUE)
usethis::use_pkgdown()
pkgdown::build_home(preview = TRUE)

On my macOS, the resulting index.html looks ok. On a brand new RStudio Cloud instance, I get the same encoding garbling. On Windows 10, I get this error...

> pkgdown::build_home(preview = TRUE)
-- Building home ---------------------------------------------------------------
  Writing 'authors.html'
UTF-8 decoding error in C:/Users/cjyetman/Documents/test4/README.md at byte offset 390 (fb).
The input must be a UTF-8 encoded text.
Error: pandoc document conversion failed with error 92
Error: [ENOENT] Failed to remove 'C:/Users/cjyetman/AppData/Local/Temp/RtmpSmRuLn/file71036f51e07.html': no such file or directory

🤷‍♂

cjyetman commented 4 years ago

here's an even more minimal reprex that still mangles the '...' in the example README that it creates when run on RStudio Cloud...

usethis::create_package(getwd(), fields = NULL, rstudio = FALSE, open = FALSE)
usethis::use_readme_md(open = FALSE)
pkgdown::build_home(preview = TRUE)

maurolepore commented 4 years ago

That's awesome! Thanks CJ! Do you plan to open an issue in xml2?

(I reopen because I closed unintentionally via a sloppy use of the word "fix" in a commit message.)

cjyetman commented 4 years ago

If you have the bandwidth to do it, please feel free to use this reprex. Not sure if/when I'll get around to it. Feel like I maxed out my time to screw around with this today for the next week or so. 😉

cjyetman commented 4 years ago

actually looks like it's a regression in xml2 v1.3.0 and it's already been reported... https://github.com/r-lib/xml2/issues/293 https://github.com/r-lib/pkgdown/issues/1284 https://github.com/r-lib/pkgdown/issues/1287

maurolepore commented 4 years ago

Best case scenario ;)

cjyetman commented 4 years ago

Looks like this is probably fixed in https://github.com/r-lib/xml2/commit/654385789de3ac2ee08d05555a32147eb8a12457

and on CRAN already xml2 v131

maurolepore commented 4 years ago

https://github.com/2DegreesInvesting/r2dii.data/issues/36

RMI-PACTA / r2dii.match

Fix encoding in website's home #185