Closed byteit101 closed 2 years ago
Yes this is indeed an issue. We can use the textshaping library to do bidirectional text, which uses fridibi for that, but we'd need to figure out how to reliably translate glyphs back into strings.
In the example below, glyphs 9-16 are flipped because it a run of RTL text.
test <- "Sarah is \u05e9\u05e8\u05d4, \u05e9\u05e8\u05d4 with \u05e9 on R"
x <- textshaping::shape_text(test)$shape
x[9:18, ]
#> # A tibble: 10 x 7
#> glyph index metric_id string_id x_offset y_offset x_midpoint
#> <int> <int> <int> <int> <dbl> <dbl> <dbl>
#> 1 8 3 1 1 44.0 0 1.66
#> 2 16 676 1 1 47.3 0 3.61
#> 3 15 696 1 1 54.6 0 3.05
#> 4 14 697 1 1 60.7 0 4.16
#> 5 13 3 1 1 69 0 1.66
#> 6 12 15 1 1 72.3 0 1.66
#> 7 11 676 1 1 75.7 0 3.61
#> 8 10 696 1 1 82.9 0 3.05
#> 9 9 697 1 1 89.0 0 4.16
#> 10 17 3 1 1 97.3 0 1.66
Created on 2021-12-08 by the reprex package (v2.0.1)
Alternatively, we could homebrew some simple flipping mechanism for stretches of RTL text, for which you could get some class information like below:
ints <- utf8ToInt(enc2utf8(string))
bidi <- Unicode::u_char_properties(ints, "Bidi_Class")$Bidi_Class
Either way, we'd end up with an extra dependency if we go this route. Any thoughts about this, Alan?
Yes, this is down to the mechanism of how we calculate glyph positions via systemfonts::shape_string
, which seems not to handle RtL text. If the string is passed as a whole block to grid
(e.g. by setting keep_straight = TRUE
), it renders correctly:
ggplot(iris, aes(x = Sepal.Length))+
geom_textpath(aes(label = "Sarah is \u05e9\u05e8\u05d4 with ש on R"),
stat = "density", vjust = -0.2, size =8, fontface = 1,
keep_straight = TRUE) +
ylim(0.1,0.5) +
annotate("text", label = "expect: Sarah is \u05e9\u05e8\u05d4 with ש on R",
y = 0.2, x = 5.75)
The easiest way to do this would be to add a dependency to textshaping
which will automatically order characters correctly (the glyphs can be recovered by matching the index
column in the output of shape_string
, but I prefer the homebrew method using the Unicode package, since it is more flexible for future development.
Alternatively, we could homebrew some simple flipping mechanism for stretches of RTL text, for which you could get some class information like below:
I would like to caution you against re-implementing the BiDi algorithm as there are some hairy cases (see some examples and more examples) though there are some tests that exist.
the glyphs can be recovered by matching the index column in the output of shape_string
The glyphs cannot be that easily recovered when you have combined characters as in #31. I accidentally figured that out when I was investigating whether I could use textshaping()
to solve both that issue and this one --and in theory it could-- but reverse engineering the glyphs isn't as straightforward.
I'm more in favour of letting {textshaping} handle shaping text (if we can) than using a homebrew method because there are all kinds of exceptions that somebody else has already figured out, it is less of a hassle to maintain and we can defer responsibility in that department.
The state of that line of investigation is currently as follows. Let's say we have some string that we want to pull through textshaping::shape_text()
, and reconstruct (in the real case we'll be extracting position parameters too). This works fine for simple strings.
test <- "Simple string"
reconstruct <- function(string) {
shape <- textshaping::shape_text(string)$shape
chars <- strsplit(string, "")[[1]]
paste0(chars[shape$glyph + 1], collapse = "")
}
reconstruct(test)
#> [1] "Simple string"
This also works fine for RtL text:
test <- "Sarah is \u05e9\u05e8\u05d4 with \u05e9 on R"
reconstruct(test)
#> [1] "Sarah is <U+05D4><U+05E8><U+05E9> with <U+05E9> on R"
However, it seems to break whenever there are compound characters:
test <- "Composed: \u00ea, DeC: e\u0302, \u05aa\u05d0\u05aa"
reconstruct(test) # prints to console as ê, doesn't reprex well
#> [1] "Composed: ê, DeC: e, <U+05D0><U+05D0>"
Created on 2021-12-08 by the reprex package (v2.0.1)
It works slightly better* if we use stringi::stri_split_boundaries(string, type = "character")
instead of a regular strsplit()
, but it isn't perfect either.
* The first two cases work, the third one also has compound characters now, but also some NA
s.
I think this is definitely something to explore further, and hopefully we can encapsulate a simple solution nicely inside the measure_text
function.
I'm a little apprehensive about introducing a lot of extra overhead (either in terms of processing, extra dependencies or coding effort) for what might be very rare use cases.
there are all kinds of exceptions that somebody else has already figured out
I don't have a great feel for how often a plot really needs a curved label comprising mixed Hebrew and Latin characters, but my guess is we would struggle to find many examples in the wild.
but my guess is we would struggle to find many examples in the wild.
Well I totally agree with you there, its been a while since I even saw a plot annotated in my own native tongue.
encapsulate a simple solution nicely inside the measure_text function
Apart from some minor discrepancies (the retrieval of glyphs being one of the major ones), it should theoretically be a sort of drop-in replacement for systemfonts::shape_string()
I'm a little apprehensive about introducing a lot of extra overhead (either in terms of processing, extra dependencies or coding effort) for what might be very rare use cases.
Me too, and that is why it would be great if we can just get this to work with textshaping (meaning we can trade-in systemfonts as a dependency).
Do you think it is a good idea if I post a feature request at textshaping to ask whether they can add the character glyphs?
Interestingly, the docs for textshaping::shape_text
say that glyph
should be actual glyphs (though it also looks as if the authors have copy / pasted their docs from systemfonts::shape_string
, If you look at the code, the line that looks as if it should convert the glyph numbers to characters is commented out.. That line certainly doesn't convert the glyph number in the output of the C function to glyphs, so the commenting out is obviously purposeful.
Strangely enough, the index
seems to be the actual unicode code point minus 29, so:
intToUtf8(textshaping::shape_text(string)$shape$index + 29, multiple = TRUE)
might work
It looks like perhaps something like this could work:
index_to_utf8 <- function(ints)
{
nums <- lapply(ints + 29, function(codepoint) {
if (codepoint <= 0x7f) return(codepoint)
if (codepoint <= 0x7ff)
return(c(bitwOr(bitwShiftR(codepoint + 787, 6), 0xc0),
bitwOr(bitwAnd(codepoint + 787, 0x3f), 0x80)))})
`Encoding<-`(rawToChar(as.raw(unlist(nums))), "UTF-8")
}
Which seems to do the trick of harvesting the glyphs from the index column in shape_text
in the right order (on this example at least):
string <- "Sarah is \u05e9\u05e8\u05d4 with ש on R"
string
#> [1] "Sarah is שרה with ש on R"
index_to_utf8(textshaping::shape_text(string)$shape$index)
#> [1] "Sarah is הרש with ש on R"
This looks quite promising! I think the docs mention that the index is an index into the font file character table, so I tested all fonts on my machine:
index_to_utf8 <- function(ints)
{
nums <- lapply(ints + 29, function(codepoint) {
if (codepoint <= 0x7f) return(codepoint)
if (codepoint <= 0x7ff)
return(c(bitwOr(bitwShiftR(codepoint + 787, 6), 0xc0),
bitwOr(bitwAnd(codepoint + 787, 0x3f), 0x80)))})
`Encoding<-`(rawToChar(as.raw(unlist(nums))), "UTF-8")
}
string <- "Sarah is \u05e9\u05e8\u05d4 with ? on R"
# 549 of them
all_fonts <- systemfonts::system_fonts()$path
all_indices <- lapply(all_fonts, function(path) {
textshaping::shape_text(string, path = path)$shape$index
})
all_strings <- vapply(all_indices, index_to_utf8, character(1))
print(all_strings[[1]]) # Arial (correctly displayed, but not in reprex)
#> [1] "Sarah is <U+05D4><U+05E8><U+05E9> with ? on R"
# Most common cases
head(sort(table(all_strings), decreasing = TRUE))
#> all_strings
#> Sarah is \035\035\035 with ? on R
#> 296
#> \035\035\035\035\035\035\035\035\035\035\035\035\035\035\035\035\035\035\035\035\035\035\035\035
#> 53
#> R`q`g hr \035\035\035 vhsg > nm Q
#> 26
#> 3<U+03E2><U+03F3><U+03E2><U+03E9> <U+03EA><U+03F4> \035\035\035 <U+03F8><U+03EA><U+03F5><U+03E9> <U+05BE> <U+03F0><U+03EF> 2
#> 24
#> 3<U+03B3>t<U+03B3><U+03BA> <U+03BB><U+03C5> \035\035\035 <U+03C9><U+03BB>f<U+03BA> <U+051B> <U+03C1>p 2
#> 17
#> Sarah is <U+05D4><U+05E8><U+05E9> with ? on R
#> 13
Created on 2021-12-09 by the reprex package (v2.0.0)
So the assumptions don't appear to hold for every font. However, we don't need these assumptions to hold for every font, because we can use the shaping once on a font for which we know it works, and then have the glyphs we need to do shaping on the true font. This is essentially shaping the text twice, but I think the trade-off seems good as we no longer measure the text during drawing/window resizing time.
A more robust method would be to get the font table for, say, Arial, and create a lookup table. I found a parseable json version online.
library(jsonlite)
library(textshaping)
url <- paste0("http://zuga.net/articles/",
"unicode-all-characters-supported-by-the-font-arial/",
"d/p/1000-x.json?v=41")
codes <- read_json(url)
codes <- codes[-(1:2)]
codes <- lapply(codes, function(x) x$I)
glyphs <- unlist(sapply(codes, function(y) sapply(y, function(x) x$P)))
index <- unlist(sapply(codes, function(y) sapply(y, function(x) x$G)))
lookup <- numeric(max(index))
lookup[index] <- intToUtf8(glyphs, multiple = TRUE)
string <- "Sarah is \u05e9\u05e8\u05d4 with \u05e9 on R"
string
#> [1] "Sarah is שרה with ש on R"
paste0(lookup[shape_text(string)$shape$index], collapse = "")
#> [1] "Sarah is הרש with ש on R"
We could have the lookup
data frame hard-coded as an R file in the package.
Yes that does make sense to me, the only caveat I can spot is that non every operating system has the Arial font. For example, a unit test in {systemfonts} also tests different fonts per system. A possible remedy is to include a font ourselves with this package, that can act as a reference font. We'd have to carefully read the licence of any font we consider for that purpose, because not all are free to distribute for personal and commercial use.
One candidate is the Roboto font which is relatively popular, has an Apache v2.0 license that (this is no legal advice, I'm no lawyer) appears to be pretty permissive. A precedent for including fonts is in the hrbrthemes package.
Actually, we may be trying to reinvent the wheel here. systemfonts::glyph_info(string)$index
gives you the index of each glyph in string
, so doing something like systemfonts::glyph_info(intToUtf8(1:64000, multiple = TRUE))
gives you the full Unicode lookup table for glyphs in whichever font you choose.
That seems pretty handy! We might simply cache the result of that until a new font is requested. But it does mean that we can't trade-in systemfonts for textshaping (which I'm ok with).
We can't trade in systemfonts for textshaping anyway - systemfonts is a dependency of textshaping, so textshaping would be an addition rather than a substitution. If we accept that we want bidi enough that we need the overhead of an extra package, then textshaping is fine, since it has a minimal dependency chain (just cpp11, which doesn't have any strict dependencies of its own)
I'm in favour of adopting textshaping because not only does it do bidi text, it also does composite characters and perhaps other finnicky things that I am keen to not have to worry about myself. Also your investigations makes it more feasible than I realized at first.
Alright this should be fixed now, so I'll be closing this.
It looks like the design decisions here were vindicated @teunbrand - see this twitter link
Code
Expected
Text should show "Sarah is שרה with ש on R" on a line, exactly as shown in the normal text geom at the bottom
Actual
Notes
Hebrew used here because it's the example at the top of the BiDi Wikipedia page, but all RTL langauges are affected, including override marks