Use code to generate Unicode-LaTeX character mapping table

nanxstats commented 4 months ago

Fixes #218

This PR creates an internal function in R/utils.R to generate the mapping table into R/unicode_latex.R.

This eliminates the need for using the binary file sysdata.rda and is more friendly for version control.

The new, code-generated data frame is bitwise identical to the version saved in sysdata.rda, except that the int column is of class integer, not numeric.

Data ingestion issue worth following up

You might want to check the data ingestion logic. I found no evidence on how the previous version was constructed. I used some ad hoc logic to get an identical version of the table, but it would be good to check if the data included in the previous version is reasonable, or what specific filters were applied. For example, from the beginning, without using quote = "" in read.table(), it will give:

Warning message:
In scan(file = file, what = what, sep = sep, quote = quote, dec = dec,  :
  EOF within quoted string

This will result in only 1740 rows vs. 2757 rows when using quote = "", which avoids the warning.

nanxstats commented 4 months ago

@yihui in case you got a minute to review

nanxstats commented 4 months ago

First, I'd prefer using a matrix to write the data, which is a little more compact than the data frame.

Second, I wonder if it's worth the effort to make the file R/unicode_latex.R human-readable. If not, we could consider just dump() the data frame in update_unicode_latex().

I don't have a strong opinion on either point. It's fine to merge the current PR as is.

Great! Thanks. I've applied the changes and updated the table. The matrix version is exactly what we need to be less tedious. How I hoped there could be a row-wise data frame constructor in base. 😂

Making it human-readable seems to be manageable in this case, so let's just keep it that way.

Merck / r2rtf

Use code to generate Unicode-LaTeX character mapping table #223

Data ingestion issue worth following up