Rapporter / pander

An R Pandoc Writer: Convert arbitrary R objects into markdown
http://rapporter.github.io/pander/
Open Software License 3.0
294 stars 66 forks source link

Pander can not encode UTF-8 in rows or columns #280

Closed NaserMonsefi closed 5 years ago

NaserMonsefi commented 7 years ago

Hi,

I was using pander with a matrix containing UTF-8 col names and released that pander can not recognise them. I dig a little deeper and noticed that actually pander have no problem with UTF-8 characters anywhere else beside row or col names. Further, I noticed that pander encodes them from UTF-8 to latin1 but for some reason this doesn't happen for row or col names. I made a small matrix to test this and it looks like this:

image

The encoding for this data shows that the first two are UTF-8 (β) with longer tail on beta and the two others are latin1 (ß) with chopped beta tail. This is true for the rownames and colnames as well.

image

Now if it is passed to pander it looks as follow:

image

First pander encoded all the UTF-8 (β) in the matrix to latin1 (ß) and printed them. But for some reason this doesn't happen for row and col names. Pander was only able to print the latin1 (ß) correctly in rows and cols. My question is first, how can I make sure that pander actually print UTF-8 in the row and col as well? Also it is preferred if it actually pass them as UTF-8 not as latin1 in the matrix and for rows and cols.

Thanks, Naser

daroczig commented 7 years ago

This report seems to be similar to #228 -- are you on Windows? Can you please share your devtools::session_info()? And also the data object eg via dput.

NaserMonsefi commented 7 years ago

Thanks a lot for coming back to me so quick, here is the sessioninfo: image I am afraid that dput will mess up the unicodes, I uploaded the RDS file here: https://www.dropbox.com/s/t1u20gybxirrmt1/data_utf8.RDS?dl=0 Hopefully this works,

Yours, Naser

daroczig commented 7 years ago

Thanks for the details! Runnig here works OK:

pander 280

Although I'm on Linux and using UTF-8 locale. Can you pls also try to set the locale to UTF-8? pander doesn't do any specific character encoding updates, so I suspect this issue is rather due to the local config. Eg what if you update the Encoding of the object? Any help is highly appreciated here, I don't have access to Windows on a regular basis.

NaserMonsefi commented 7 years ago

You are absolutely correct, seems to be a windows problem. It worked on my linux vbox. Neither of English locale worked either (although they supposed to be utf8) Guess, for windows i might change encoding of the data to native(latin1) before using pander. Yours, Naser

NaserMonsefi commented 7 years ago

I think I found the cause for the problem, So if I use to change Encoding like this, it gave the same wrong format for UTF-8 (β) (forcing encodign to latin1 that is native): image

but if I use enc2native function instead, it doesn't make the weird character and all characters are in the latin1 (ß) form. image

But my guess would be that somehow pander uses enc2native for the data in the matrix but uses Encoding for row and col names to transfer to native, creating the incorrect characters. This will sort of work, meaning that seems you can not get UTF-8 characters in windows for pander but still can change them to native and then use pander.

Yours, Naser

daroczig commented 7 years ago

Might be related to some internal Rcpp stuff, but AFAIK we pass all headers + table body to the same functions. cc @RomanTsegelskyi for confirmation

BTW can you please let me know, @NaserMonsefi, how you created this data.frame? This Windows behaviour (like in #228) to have different encoding for table header and content really freaks me out.

NaserMonsefi commented 7 years ago

I originally noticed the problem, importing a data set using read.delim

read.delim('..data.csv', sep = ',', stringsAsFactors = FALSE, encoding = 'UTF-8', check.names = F)

The files is encoded in UTF-8 and have header names with the UTF-8 beta in it. Of course if i use check.names = T it will encode to "unknown" with more wrong characters. I think I found a solution for my case as mentioned above, but don't know what is causing it on the OS level. Yours, Naser

nbarrowman commented 7 years ago

I have been having the same problem, also on Windows. Thanks Naser, enc2native also worked for me.

awfrankwils commented 6 years ago

https://github.com/Rapporter/pander/issues/296#issuecomment-419220848 #296

daroczig commented 6 years ago

I tested #326 in a Windows VM started and seems to do the trick, but please confirm.

daroczig commented 5 years ago

Should be fixed with the above commit.

billdenney commented 4 years ago

@daroczig, I just had the same issue. Is there a way that I could help in some way to release a new version of pander with this fix (and all others that have been made)?

daroczig commented 4 years ago

@billdenney you mean a CRAN release? I will need to look into the CI builder as seems to be failing and do a general check-up on the package ... I have not really touched it for a while. I can do that in a few weeks hopefully, but would appreciate any help someone running all the tests and R CMD check using dev version of R etc and create a PR for a CRAN release.