foliojs / pdfkit

A JavaScript PDF generation library for Node and the browser
http://pdfkit.org/
MIT License
9.86k stars 1.15k forks source link

Right-to-Left (RTL) support for Hebrew and Arabic #219

Open boustanihani opened 10 years ago

boustanihani commented 10 years ago

Please add Right-to-Left (RTL) support for languages like Hebrew and Arabic...

Something like:

doc.rtl(true);

doc.text('...', {rtl: true});
devongovett commented 10 years ago

I don't know much about RTL languages, but it seems to me that you could reverse the string and align to the right to get this working (I'm probably wrong here). However, if the text contains a combination of LTR and RTL text, then we'll need an implementation of the Unicode Bidi Algorithm. Those who know more than I do, please fill me in. I'd love to see this implemented so PDFKit is more widely usable.

devongovett commented 10 years ago

A separate issue is vertical text support (e.g. Japanese), which I'd also like to see and which has its own challenges.

boustanihani commented 10 years ago

Arabic also has its own challenges because letters get a different shape depending on their position in a word (beginning, middle, end) so this is anything but easy :)

devongovett commented 10 years ago

Interesting, I assume there is some sort of algorithm out there to determine this? Starting to sound like a lot of work.

etodanik commented 10 years ago

I'm just parachuting in, but isn't this something like what you need: https://github.com/mathiasbynens/node-unicode-data

Why re-implement unicode algorithms?

EDIT: Wait a moment, this seems to be quite far from what's needed, my bad. But isn't there a ready implementation?

devongovett commented 10 years ago

No, that's just unicode character metadata, not any actual algorithms. RTL support will require an implementation of the Unicode Bidi Algorithm. Shaping of Arabic text with contextual substitutions is a separate problem to solve.

devongovett commented 10 years ago

Yeah, this library from Twitter might work but I haven't tried it.

etodanik commented 10 years ago

I'll go ahead and fork pdfkit, and see what i can come up with. Any pointers for where to start and how you'd approach it?

devongovett commented 10 years ago

I'd try Twitter's library and see if it produces the results you expect. Sorry for being so ignorant on this, but does it work to run the text through that library, then send the result to the PDFKit doc.text method?

etodanik commented 10 years ago

I found something that might be even more to the point: https://github.com/cscott/node-icu-bidi

devongovett commented 10 years ago

Yeah, the problem is that node-icu-bidi is a node C++ module, but PDFKit also works in the browser, so everything must be pure JavaScript. If it works for your needs, feel free to use it, but PDFKit won't take on a non-JS dependency.

etodanik commented 10 years ago

I understand, so an acceptable solution would be to extract the BIDI algorithm from the twitter library, correct?

On Tue, Aug 5, 2014 at 8:00 PM, Devon Govett notifications@github.com wrote:

Yeah, the problem is that node-icu-bidi is a node C++ module, but PDFKit also works in the browser, so everything must be pure JavaScript. If it works for your needs, feel free to use it, but PDFKit won't take on a non-JS dependency.

— Reply to this email directly or view it on GitHub https://github.com/devongovett/pdfkit/issues/219#issuecomment-51227171.

yelouafi commented 9 years ago

maybe i'm late to the party; just wanted to mention i've implemented (a looong ago) a similar solution in DOS (with the old fashioned 16x16 bitmap fonts); but i think the same approach can be applied here

1- reorder the input string using the Bidi algorithm 2- reshape by applying single glyph substitution depending in the context (beginning, middle, end of the word or standalone glyph). 3- ligatures 4- inverse the alignment (possibly using a RTL flag); if this is supported then a more appropriate naming of alignment options should be : leading/trailing instead of right/left

1 and 4 are the 'easy parts'; for 2 and 3 it's another story: for the OpenType fonts i think there is a GSUB table that can be used for this; but for other font types the only option i think is to implement the specific algorithm for each script (as you said this is a lot of work)

yelouafi commented 9 years ago

it seems another solution to Arabic shaping is the use of 'Text based Shaping' that transforms the characters on the string level rather than in the Glyph level (further details are there). And it seems there is already an implementation of this kind in Javascript by the ibm-js team. From the sources it appears that the text engine performs a bunch of operations at the character level:

1- Bidi reordering 2- Text shaping (AFAIK applies only to Arabic scripts) 3- symmetrical swapping (replace [(.. with their symmetrical RTL )]... ) 4- Number shaping (replace 'Western-Arabic)' numbers 0, 1,2 ... with their Eastearn Arabic counterparts ٠‎,١,‎٢ ...‎‎)

This can be also a possible fallback to non OpenType fonts which doesn't have a GSUB table

devongovett commented 8 years ago

Getting closer. With v0.8.0 the font engine changed to fontkit, which supports an Arabic shaper (e.g. @yelouafi's steps 2 and 3). Still need to implement the bidi algorithm for mixed script text.

soryy708 commented 7 years ago

If you prioritize Bidi reordering, and symmetrical swapping, it's enough for Hebrew support. While technically Hebrew has characters that look different when they're in the end of the word, you shouldn't care about it because unicode defines them as separate characters. Text & number shaping can be added later for Arabic support.

StigP1337 commented 7 years ago

I found the following infos related to this topic. Python Arabic Reshaper is a library which can be used in cases when native Arabic support is not available. The readme contains a good explanation of the issue and the solution. This library has been ported to Javascript.

On the BIDI topic I found this test program written in Javascript.

sayamqazi commented 7 years ago

There are GSUB (Glyph substitution) tables in font files for Complex languages. This link explains those tables with example. https://www.microsoft.com/typography/otfntdev/arabicot/features.aspx

mohanagy commented 7 years ago

PDFKIT still has a problem with RTL any updates? @devongovett

setpixel commented 6 years ago

Hi! @devongovett any update on RTL support? Question 2, is this project dead?

setpixel commented 6 years ago

Please don't be dead :(

aminify commented 6 years ago

@setpixel I needed this too, but since this doesn't sound that they have added this feature I want to inform you I found jsPDF really useful. they support arabic now.

ninbit commented 5 years ago

pdfkit has more functionality than jsPDF. jsPDF doesn't have full unicode support but pdfkit does. The project and its committers deserve the praise. For RTL, right-aligned text works very well. However, when we want to use columns, things change. The need is just to start from right-most column through the left most column. @devongovett we don't need anything except this I think because the RTL text has its RTL way, no need to reverse the strings. (same for LTR inside RTL)

etodanik commented 5 years ago

RTL is much more than right aligned text. There’s the issue of comma and dot positions, and what happens when LTR stuff like numbers and English text are mixed in a sentence.

mayassalman commented 5 years ago

Capture

andreialecu commented 5 years ago

I was able to manually handle Hebrew with code like:

npm install twitter_cldr

import * as TwitterCldrLoader from "twitter_cldr";

const TwitterCldr = TwitterCldrLoader.load("en");

class ... {
  private isHebrew(text: string) {
    var position = text.search(/[\u0590-\u05FF]/);
    return position >= 0;
  }

  private maybeRtlize(text: string) {
    if (this.isHebrew(text)) {
      var bidiText = TwitterCldr.Bidi.from_string(text, { direction: "RTL" });
      bidiText.reorder_visually();
      return bidiText.toString();
    } else {
      return text;
    }
  }
}

Just pass all text that may be in Hebrew through the maybeRtlize function.

It's not perfect and I only tested it for Hebrew, but it seems to work pretty good. If you also need right alignment, use something like isHebrew(myText) ? { align: "right" } : null for alignment.

The problem is that if the text wraps onto multiple lines, the first word of the text will be on the last line, which is wrong. There needs to be more logic added to handle line breaks.

malikfaiq commented 4 years ago

is there any solution to support urdu and arabic in pdfkit till now.

FlandersBurger commented 4 years ago

Simply reversing the text before it goes to pdfkit seems to work for both Hebrew and Arabic (I'm just eyeballing the text however since I speak neither)

const isHebrew = (text) => {
  return text.search(/[\u0590-\u05FF]/) >= 0;
};

const isArabic = (text) => {
  return text.search(/[\u0600-\u06FF]/) >= 0;
};

const rightToLeftText = (text) => {
  if (isHebrew(text) || isArabic(text)) {
    return text.split(' ').reverse().join(' ');
  } else {
    return text;
  }
};

rightToLeftText('أنا أتحدث اللغة العربية');
rightToLeftText('אני מדברת עברית');
soryy708 commented 4 years ago

You have to pay attention that Arabic script has symbols that combine with neighboring symbols based on what they are and where they are.

soryy708 commented 4 years ago

return text.split(' ').reverse().join(' ');

What of non-RTL text? Like a mix between Hebrew and numbers. Or English and Arabic. Consider the following example:

יש לי 500 tokenים של globus2000. By the way, looks like GitHub is treating this wrong xD

FlandersBurger commented 4 years ago

return text.split(' ').reverse().join(' ');

What of non-RTL text? Like a mix between Hebrew and numbers. Or English and Arabic. Consider the following example:

יש לי 500 tokenים של globus2000. By the way, looks like GitHub is treating this wrong xD

Fair point but it doesn't actually apply in my use case. Perhaps splitting it into RTL and LTR chunks and then only reversing the RTL chunks would work? Worth a shot, especially since none of the other solution in here worked for me.

devongovett commented 4 years ago

That’s pretty much what the Unicode bidi algorithm does: http://www.unicode.org/reports/tr9/

FlandersBurger commented 4 years ago

That’s pretty much what the Unicode bidi algorithm does: http://www.unicode.org/reports/tr9/

Will the bidi algorithm be embedded in pdfkit?

devongovett commented 4 years ago

Sure, if someone wants to implement it.

alex-enchi commented 4 years ago

There's a JS implementation of tr9 https://github.com/bbc/unicode-bidirectional not sure how accurate it is.

weera-tech commented 4 years ago

Simply reversing the text before it goes to pdfkit seems to work for both Hebrew and Arabic (I'm just eyeballing the text however since I speak neither)

const isHebrew = (text) => {
  return text.search(/[\u0590-\u05FF]/) >= 0;
};

const isArabic = (text) => {
  return text.search(/[\u0600-\u06FF]/) >= 0;
};

const rightToLeftText = (text) => {
  if (isHebrew(text) || isArabic(text)) {
    return text.split(' ').reverse().join(' ');
  } else {
    return text;
  }
};

rightToLeftText('أنا أتحدث اللغة العربية');
rightToLeftText('אני מדברת עברית');

This is exactly what I am looking for. Just a bit improvement: For RTL languages like persian (as I use it), add a space to the end of the string: text.split(' ').reverse().join(' ') + ' '; This will work like a charm!!! Remember that if your string have special characters (e.g. ":") at the end, put it before added white space.

andreialecu commented 4 years ago

Just a note for whoever is still stuck on this that reversing the text is not a good idea. It will reverse things like numbers and various other things that should not be reversed. 123456 might result in being reversed to 654321

Use a library meant for this, like TwitterCldr, see https://github.com/foliojs/pdfkit/issues/219#issuecomment-528292446

weera-tech commented 4 years ago

Just a note for whoever is still stuck on this that reversing the text is not a good idea. It will reverse things like numbers and various other things that should not be reversed. 123456 might result in being reversed to 654321

Use a library meant for this, like TwitterCldr, see #219 (comment)

Note: We are reversing array of words, not array of characters!!! I am trying twitterCLDR and problem still persists. In my case, problem isn't about character ordering, it is about white spaces. If you are using linux, as I, just install suitable language package, this will resolve character ordering and it will not be a problem anymore. TwitterCLDR is good for white space ordering but it operates character ordering simultaneously, and it is not good. The best manipulation is reverse() for me.

andreialecu commented 4 years ago

@weera-tech the actual letters need to be reversed too. Not just the word order is supposed to be reversed in rtl writing.

weera-tech commented 4 years ago

@weera-tech the actual letters need to be reversed too. Not just the word order is supposed to be reversed in rtl writing.

You are right, but I said that first install suitable language package, in RTL direction, you have to set align to right. Therefore it will have conflict with TCLDR character ordering. simple: -1 * -1 = 1 :)

andreialecu commented 4 years ago

I'm not sure what sort of mechanism would actually reverse characters for you, but not words, considering pdfkit has no rtl support whatsoever. Perhaps something weird is happening on Linux. I'm using pdfkit in the browser with webpack.

In my experience, and I have a production app using this approach with TwitterCLDR and pdfkit, simply reversing words resulted in support tickets being issued for exactly this problem. Words where in the correct order, but letters were in the wrong order.

weera-tech commented 4 years ago

Ooops!!! You are using it in client-side? I am using server-side. Probably this is our difference.

devongovett commented 4 years ago

The only correct implementation will be the Unicode bidi algorithm. Anything else, especially reverse(), will be incorrect.

andreialecu commented 4 years ago

There is a recent WASM build of the HarfBuzz engine which is a text shaping engine used by Firefox Chrome, and others.

https://github.com/harfbuzz/harfbuzzjs

It does support Unicode bidi algorithms among other things. I believe it could be integrated with pdfkit to solve RTL once and for all.

There is a demo here: https://harfbuzz.github.io/harfbuzzjs/

Some discussion about it being used to solve RTL issues for Photopea, which is a very popular online image editor: https://github.com/harfbuzz/harfbuzzjs/issues/10

Unfortunately I'm not familiar at all with pdfkit's text rendering, but perhaps someone could look into it.

AlexeiLevinzon commented 4 years ago

Hey,

Any news with RTL support?

andreialecu commented 4 years ago

@devongovett from my limited understanding of fontkit it seems that it does indeed support rtl.

I found this site and I was able to see rtl text being rendered properly. https://fontkit-demo.now.sh/

Also from what I understand, pdfkit is based on fontkit so what is stopping this from working?

alex-enchi commented 4 years ago

@andreialecu because RTL support is more than glyph rendering

rtl is something weird

The only proper way to render rtl language is

  1. determine flow of the paragraph (rtl or ltr)
  2. run text through unicode bidi
  3. render text, start position is determined by is paragraph rtl or ltr
amitm02 commented 4 years ago

I too would love to have an RTL support (Hebrew).

afsheen1 commented 4 years ago

+1 for rtl support

mayassalman commented 4 years ago

Think out of the box
use puppeteer