cage1016 / pdfium

Automatically exported from code.google.com/p/pdfium
0 stars 0 forks source link

Arabic text highlight issue #43

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
This issue is from Jeff:

"
Bo, I am responsible for the PDF representation of millions books 
digitized by Google. I also wrote the PDF code for Tesseract, 
which is a leading open source OCR engine. I have 100% control 
over every aspect of these PDF files. We would be very happy to 
become more compatible with Foxit.

Experimentally, I have found that forcing a character bounding
box in the font works well for Latin scripts, but produced erratic 
results with other languages. As always, any thoughts or suggestions 
appreciated.
"

Original issue reported on code.google.com by bo...@foxitsoftware.com on 17 Aug 2014 at 5:43

GoogleCodeExporter commented 9 years ago

Original comment by bo...@foxitsoftware.com on 17 Aug 2014 at 5:44

Attachments:

GoogleCodeExporter commented 9 years ago
This is a very, very simple PDF file. Are there any complaints?

Original comment by jbrei...@google.com on 21 Aug 2014 at 4:58

Attachments:

GoogleCodeExporter commented 9 years ago
This one only has one text object? Can you make a line of text with a few text 
object? And I guess this one is still generated with the original way?

Original comment by bo...@foxitsoftware.com on 21 Aug 2014 at 10:09

GoogleCodeExporter commented 9 years ago
This example has two words. One word is right-to-left, the other is 
left-to-right.

BT
3 Tr
1 0 0 1 8.2 8.64 Tm /f-0-0 14 Tf 82.358 Tz [ <0061><006C><006F> ] TJ 
-1 0 0 1 56.2 8.64 Tm 78.172 Tz [ <05D1><05D0><05EA><05E8><200E> ] TJ 
ET

I think I am generating correct PDF for both simple.pdf and simple-2.pdf. 
Please let me know if you disagree.

Original comment by jbrei...@google.com on 22 Aug 2014 at 12:50

Attachments:

GoogleCodeExporter commented 9 years ago
Here is an example with three Hebrew words. Everything is in reading order. Is 
that a problem?

BT
3 Tr
-1 0 0 1 136.8 11.52 Tm /f-0-0 17 Tf 
        67.764 Tz [ <05DC><05DC><05E7><05D5><05D7><05D5><05EA> ] TJ 
46 0 Td 55.906 Tz [ <05D0><05DC><05D5><05E0><05D9> ] TJ 
29 0 Td 76.236 Tz [ <05D7><05D5><05DB><05DE><05EA> ] TJ 
ET

Original comment by jbrei...@google.com on 22 Aug 2014 at 1:02

Attachments:

GoogleCodeExporter commented 9 years ago
@jbreiden, there is no specific requirements on how the characters are ordered 
in the pdf document. Reading order is fine, but most document we have seen are 
in displaying order.
In this case, we should handle the document you uploaded.

Original comment by bo...@foxitsoftware.com on 11 Sep 2014 at 9:35

GoogleCodeExporter commented 9 years ago
Fixed in 
https://pdfium.googlesource.com/pdfium/+/56ef173042d786281edcbbc9f1c38c8f97ef10d
5

Original comment by bo...@foxitsoftware.com on 11 Sep 2014 at 9:35