danbernier / WordCram

open-source word clouds for Processing
http://wordcram.org
Apache License 2.0
199 stars 52 forks source link

Arabic text is not displayed properly. #37

Closed faridoon closed 11 years ago

faridoon commented 11 years ago

Wordcram works well with many languages but when it comes to Arabic, it displays it from left to right whereas it should be from Right to Left. I hope someone looks into this issue.

danbernier commented 11 years ago

Hebrew is the other big RTL language, right?

Wikipedia's RTL article currently lists these scripts as RTL:

Arabic, Avestan, Cypriot, Hebrew, Imperial Aramaic, Kharosthi, Lydian, Mandaic, N'Ko, Old South Arabian, Old Turkic, Pahlavi, Phoenician, Samaritan, Syriac, Thaana, Umbrian

cue.lang supports these languages:

Arabic, Catalan, Croatian, Czech, Dutch, Danish, English, Esperanto, Farsi, Finnish, French, German, Greek, Hebrew, Hindi, Hungarian, Italian, Latin, Norwegian, Polish, Portuguese, Romanian, Russian, Slovenian, Slovak, Spanish, Swedish, Turkish

The only overlap is Arabic & Hebrew. I think I could put in a check to see if cue detects either of those, and render RTL. Does that sound about right?

danbernier commented 11 years ago

Oh, here's the cue.lang readme: https://github.com/vcl/cue.language

faridoon commented 11 years ago

In one of the examples of Wordcram (fromWebPage), I tested the Arabic Wikipedia's homepage. Unfortunately, the words are still printed from Left to Right. Below is a screenshot: screen shot 2013-07-15 at 10 40 58 pm

And the same problem persists with Farsi language:

screen shot 2013-07-15 at 10 42 10 pm

danbernier commented 11 years ago

@faridoon, take a look at these PNGs I generated - the first is without the fix, the second is with. Let me know if everything looks good, and I'll merge this into rel060, so it'll be in the next release.

(Actually, I just noticed the LTR words in there are backwords, hah! like "hsilgnE". I think that's ok.)

Buggy:

old arabic wordcram

Fixed:

new arabic wordcram

faridoon commented 11 years ago

Thank you Dan. It's great. Isn't it possible to render RTL and LTR scripts separately in the sketch? I think one can ignore the LTR scripts inside a big wordcram of RTL language (e.g. Arabic or Farsi). But if there are two languages of equal weight in the wordcram, they wouldn't look good.

When can we expect the next release?

danbernier commented 11 years ago

Awesome! I'll merge this into the 060 release branch now.

WordCram treats the text as one whole body of text, and uses cue.lang to guess which language it's in (the method is literally named guess). It's probabilistic, and works better on larger sets, so, on individual words, it'd probably guess wrong much of the time.

(But you've got me thinking now - maybe RTL can be a property of the word, so you can set them individually. I really didn't like the way I fixed this RTL bug - this suggests a cleaner path, that gives users more control. Great suggestion!)

We're working on the next release now, and I'd love to have it done this summer. You can track our progress on this pull request: https://github.com/danbernier/WordCram/pull/38