AllanCameron / geomtextpath

Create curved text paths in ggplot2
https://allancameron.github.io/geomtextpath
Other
625 stars 24 forks source link

[Unicode] RTL scripts are rendered backwards #32

Closed byteit101 closed 2 years ago

byteit101 commented 2 years ago

Code

ggplot(iris, aes(x=Sepal.Length))+
geom_textpath(aes(label="Sarah is \u05e9\u05e8\u05d4 with ש on R"), stat="density", vjust=-0.2, size=8, fontface=1)+
ylim(0.1,0.5) +
annotate("text", label="expect: Sarah is \u05e9\u05e8\u05d4 with ש on R", y=0.2, x=5.75)

Expected

Text should show "Sarah is שרה with ש on R" on a line, exactly as shown in the normal text geom at the bottom

Actual

RTL confusion

Notes

Hebrew used here because it's the example at the top of the BiDi Wikipedia page, but all RTL langauges are affected, including override marks

teunbrand commented 2 years ago

Yes this is indeed an issue. We can use the textshaping library to do bidirectional text, which uses fridibi for that, but we'd need to figure out how to reliably translate glyphs back into strings.

In the example below, glyphs 9-16 are flipped because it a run of RTL text.

test <- "Sarah is \u05e9\u05e8\u05d4, \u05e9\u05e8\u05d4 with \u05e9 on R"
x <- textshaping::shape_text(test)$shape
x[9:18, ]
#> # A tibble: 10 x 7
#>    glyph index metric_id string_id x_offset y_offset x_midpoint
#>    <int> <int>     <int>     <int>    <dbl>    <dbl>      <dbl>
#>  1     8     3         1         1     44.0        0       1.66
#>  2    16   676         1         1     47.3        0       3.61
#>  3    15   696         1         1     54.6        0       3.05
#>  4    14   697         1         1     60.7        0       4.16
#>  5    13     3         1         1     69          0       1.66
#>  6    12    15         1         1     72.3        0       1.66
#>  7    11   676         1         1     75.7        0       3.61
#>  8    10   696         1         1     82.9        0       3.05
#>  9     9   697         1         1     89.0        0       4.16
#> 10    17     3         1         1     97.3        0       1.66

Created on 2021-12-08 by the reprex package (v2.0.1)

Alternatively, we could homebrew some simple flipping mechanism for stretches of RTL text, for which you could get some class information like below:

ints <- utf8ToInt(enc2utf8(string))
bidi <- Unicode::u_char_properties(ints, "Bidi_Class")$Bidi_Class

Either way, we'd end up with an extra dependency if we go this route. Any thoughts about this, Alan?

AllanCameron commented 2 years ago

Yes, this is down to the mechanism of how we calculate glyph positions via systemfonts::shape_string, which seems not to handle RtL text. If the string is passed as a whole block to grid (e.g. by setting keep_straight = TRUE), it renders correctly:

ggplot(iris, aes(x = Sepal.Length))+
  geom_textpath(aes(label = "Sarah is \u05e9\u05e8\u05d4 with ש on R"),   
                          stat = "density", vjust = -0.2, size =8, fontface = 1, 
                          keep_straight = TRUE) +
  ylim(0.1,0.5) +
  annotate("text", label = "expect: Sarah is \u05e9\u05e8\u05d4 with ש on R", 
                 y = 0.2, x = 5.75)

image

The easiest way to do this would be to add a dependency to textshaping which will automatically order characters correctly (the glyphs can be recovered by matching the index column in the output of shape_string, but I prefer the homebrew method using the Unicode package, since it is more flexible for future development.

byteit101 commented 2 years ago

Alternatively, we could homebrew some simple flipping mechanism for stretches of RTL text, for which you could get some class information like below:

I would like to caution you against re-implementing the BiDi algorithm as there are some hairy cases (see some examples and more examples) though there are some tests that exist.

teunbrand commented 2 years ago

the glyphs can be recovered by matching the index column in the output of shape_string

The glyphs cannot be that easily recovered when you have combined characters as in #31. I accidentally figured that out when I was investigating whether I could use textshaping() to solve both that issue and this one --and in theory it could-- but reverse engineering the glyphs isn't as straightforward.

I'm more in favour of letting {textshaping} handle shaping text (if we can) than using a homebrew method because there are all kinds of exceptions that somebody else has already figured out, it is less of a hassle to maintain and we can defer responsibility in that department.

The state of that line of investigation is currently as follows. Let's say we have some string that we want to pull through textshaping::shape_text(), and reconstruct (in the real case we'll be extracting position parameters too). This works fine for simple strings.

test <- "Simple string"

reconstruct <- function(string) {
  shape <- textshaping::shape_text(string)$shape
  chars <- strsplit(string, "")[[1]]
  paste0(chars[shape$glyph + 1], collapse = "")
}

reconstruct(test)
#> [1] "Simple string"

This also works fine for RtL text:

test <- "Sarah is \u05e9\u05e8\u05d4 with \u05e9 on R"

reconstruct(test)
#> [1] "Sarah is <U+05D4><U+05E8><U+05E9> with <U+05E9> on R"

However, it seems to break whenever there are compound characters:

test <- "Composed: \u00ea, DeC: e\u0302, \u05aa\u05d0\u05aa"

reconstruct(test) # prints to console as ê, doesn't reprex well
#> [1] "Composed: ê, DeC: e,  <U+05D0><U+05D0>"

Created on 2021-12-08 by the reprex package (v2.0.1)

It works slightly better* if we use stringi::stri_split_boundaries(string, type = "character") instead of a regular strsplit(), but it isn't perfect either.

* The first two cases work, the third one also has compound characters now, but also some NAs.

AllanCameron commented 2 years ago

I think this is definitely something to explore further, and hopefully we can encapsulate a simple solution nicely inside the measure_text function.

I'm a little apprehensive about introducing a lot of extra overhead (either in terms of processing, extra dependencies or coding effort) for what might be very rare use cases.

there are all kinds of exceptions that somebody else has already figured out

I don't have a great feel for how often a plot really needs a curved label comprising mixed Hebrew and Latin characters, but my guess is we would struggle to find many examples in the wild.

teunbrand commented 2 years ago

but my guess is we would struggle to find many examples in the wild.

Well I totally agree with you there, its been a while since I even saw a plot annotated in my own native tongue.

encapsulate a simple solution nicely inside the measure_text function

Apart from some minor discrepancies (the retrieval of glyphs being one of the major ones), it should theoretically be a sort of drop-in replacement for systemfonts::shape_string()

I'm a little apprehensive about introducing a lot of extra overhead (either in terms of processing, extra dependencies or coding effort) for what might be very rare use cases.

Me too, and that is why it would be great if we can just get this to work with textshaping (meaning we can trade-in systemfonts as a dependency).

Do you think it is a good idea if I post a feature request at textshaping to ask whether they can add the character glyphs?

AllanCameron commented 2 years ago

Interestingly, the docs for textshaping::shape_text say that glyph should be actual glyphs (though it also looks as if the authors have copy / pasted their docs from systemfonts::shape_string, If you look at the code, the line that looks as if it should convert the glyph numbers to characters is commented out.. That line certainly doesn't convert the glyph number in the output of the C function to glyphs, so the commenting out is obviously purposeful.

Strangely enough, the index seems to be the actual unicode code point minus 29, so:

intToUtf8(textshaping::shape_text(string)$shape$index + 29, multiple = TRUE)

might work

AllanCameron commented 2 years ago

It looks like perhaps something like this could work:

index_to_utf8 <- function(ints)
{
  nums <- lapply(ints + 29, function(codepoint) {
    if (codepoint <= 0x7f) return(codepoint)

    if (codepoint <= 0x7ff) 
      return(c(bitwOr(bitwShiftR(codepoint + 787, 6), 0xc0),
               bitwOr(bitwAnd(codepoint + 787, 0x3f), 0x80)))})

  `Encoding<-`(rawToChar(as.raw(unlist(nums))), "UTF-8")
}

Which seems to do the trick of harvesting the glyphs from the index column in shape_text in the right order (on this example at least):

string <- "Sarah is \u05e9\u05e8\u05d4 with ש on R"

string
#> [1] "Sarah is שרה with ש on R"

index_to_utf8(textshaping::shape_text(string)$shape$index)
#> [1] "Sarah is הרש with ש on R"
teunbrand commented 2 years ago

This looks quite promising! I think the docs mention that the index is an index into the font file character table, so I tested all fonts on my machine:

index_to_utf8 <- function(ints)
{
  nums <- lapply(ints + 29, function(codepoint) {
    if (codepoint <= 0x7f) return(codepoint)

    if (codepoint <= 0x7ff) 
      return(c(bitwOr(bitwShiftR(codepoint + 787, 6), 0xc0),
               bitwOr(bitwAnd(codepoint + 787, 0x3f), 0x80)))})

  `Encoding<-`(rawToChar(as.raw(unlist(nums))), "UTF-8")
}

string <- "Sarah is \u05e9\u05e8\u05d4 with ? on R"

# 549 of them
all_fonts <- systemfonts::system_fonts()$path

all_indices <- lapply(all_fonts, function(path) {
  textshaping::shape_text(string, path = path)$shape$index
})

all_strings <- vapply(all_indices, index_to_utf8, character(1))

print(all_strings[[1]]) # Arial (correctly displayed, but not in reprex)
#> [1] "Sarah is <U+05D4><U+05E8><U+05E9> with ? on R"

# Most common cases
head(sort(table(all_strings), decreasing = TRUE))
#> all_strings
#>                                                                Sarah is \035\035\035 with ? on R 
#>                                                                                              296 
#> \035\035\035\035\035\035\035\035\035\035\035\035\035\035\035\035\035\035\035\035\035\035\035\035 
#>                                                                                               53 
#>                                                                R`q`g hr \035\035\035 vhsg > nm Q 
#>                                                                                               26 
#> 3<U+03E2><U+03F3><U+03E2><U+03E9> <U+03EA><U+03F4> \035\035\035 <U+03F8><U+03EA><U+03F5><U+03E9> <U+05BE> <U+03F0><U+03EF> 2 
#>                                                                                               24 
#> 3<U+03B3>t<U+03B3><U+03BA> <U+03BB><U+03C5> \035\035\035 <U+03C9><U+03BB>f<U+03BA> <U+051B> <U+03C1>p 2 
#>                                                                                               17 
#>                                                    Sarah is <U+05D4><U+05E8><U+05E9> with ? on R 
#>                                                                                               13

Created on 2021-12-09 by the reprex package (v2.0.0)

So the assumptions don't appear to hold for every font. However, we don't need these assumptions to hold for every font, because we can use the shaping once on a font for which we know it works, and then have the glyphs we need to do shaping on the true font. This is essentially shaping the text twice, but I think the trade-off seems good as we no longer measure the text during drawing/window resizing time.

AllanCameron commented 2 years ago

A more robust method would be to get the font table for, say, Arial, and create a lookup table. I found a parseable json version online.

library(jsonlite)
library(textshaping)

url <- paste0("http://zuga.net/articles/",
              "unicode-all-characters-supported-by-the-font-arial/",
              "d/p/1000-x.json?v=41")

codes <- read_json(url)
codes <- codes[-(1:2)]
codes <- lapply(codes, function(x) x$I)
glyphs <- unlist(sapply(codes, function(y) sapply(y, function(x) x$P)))
index <- unlist(sapply(codes, function(y) sapply(y, function(x) x$G)))
lookup <- numeric(max(index))
lookup[index] <- intToUtf8(glyphs, multiple = TRUE)

string <- "Sarah is \u05e9\u05e8\u05d4 with \u05e9 on R"

string
#> [1] "Sarah is שרה with ש on R"

paste0(lookup[shape_text(string)$shape$index], collapse = "")
#> [1] "Sarah is הרש with ש on R"

We could have the lookup data frame hard-coded as an R file in the package.

teunbrand commented 2 years ago

Yes that does make sense to me, the only caveat I can spot is that non every operating system has the Arial font. For example, a unit test in {systemfonts} also tests different fonts per system. A possible remedy is to include a font ourselves with this package, that can act as a reference font. We'd have to carefully read the licence of any font we consider for that purpose, because not all are free to distribute for personal and commercial use.

One candidate is the Roboto font which is relatively popular, has an Apache v2.0 license that (this is no legal advice, I'm no lawyer) appears to be pretty permissive. A precedent for including fonts is in the hrbrthemes package.

AllanCameron commented 2 years ago

Actually, we may be trying to reinvent the wheel here. systemfonts::glyph_info(string)$index gives you the index of each glyph in string, so doing something like systemfonts::glyph_info(intToUtf8(1:64000, multiple = TRUE)) gives you the full Unicode lookup table for glyphs in whichever font you choose.

teunbrand commented 2 years ago

That seems pretty handy! We might simply cache the result of that until a new font is requested. But it does mean that we can't trade-in systemfonts for textshaping (which I'm ok with).

AllanCameron commented 2 years ago

We can't trade in systemfonts for textshaping anyway - systemfonts is a dependency of textshaping, so textshaping would be an addition rather than a substitution. If we accept that we want bidi enough that we need the overhead of an extra package, then textshaping is fine, since it has a minimal dependency chain (just cpp11, which doesn't have any strict dependencies of its own)

teunbrand commented 2 years ago

I'm in favour of adopting textshaping because not only does it do bidi text, it also does composite characters and perhaps other finnicky things that I am keen to not have to worry about myself. Also your investigations makes it more feasible than I realized at first.

teunbrand commented 2 years ago

Alright this should be fixed now, so I'll be closing this.

AllanCameron commented 2 years ago

It looks like the design decisions here were vindicated @teunbrand - see this twitter link