ArtifexSoftware / pdf2docx

Open source Python library for converting PDF to DOCX.
https://pdf2docx.readthedocs.io
GNU Affero General Public License v3.0
2.49k stars 366 forks source link

There is no hebrew support #145

Open Ravid-Levy opened 2 years ago

Ravid-Levy commented 2 years ago

Hey Author, It support the hebrew and arabic letters but it write it in Inverted letters

where the code do the convert and get the letter? can you give me a line and I will push to you update with hebrew & arabic support

@dothinking

dothinking commented 2 years ago

Hi, Ravid,

Sorry for the late reply. It's not the first time I receive this issue report, but I'm not able to resolve it due to knowing nothing about the hebrew and arabic, or any other right-to-left language. So, it's great if you can contribute to this.

These are revelant issues: #73, #106

dothinking commented 2 years ago

where the code do the convert and get the letter?

Extract text with PyMuPDF, which seems recognize the rtl language correctly. Then, write text to docx with python-docx. Based on pdf2docx 0.5.4, the revelant codes writing text to docx:

The direct fix seems to reverse both spans and span.text, like

for span in self.spans[::-1]: span.make_docx(p)
docx_run = paragraph.add_run(self.text[::-1])

But, as mentioned #106, "the problem is that the syntax is also reversed". Then I'm not able to proceed since have no idea on the syntax.

dothinking commented 2 years ago

By the way, the relation between Line and TextSpan: Line consists of a list of TextSpan-s, while the letters are contained in each TextSpan. For example, a line "a brown fox jumps over a lazy dog" might look like: a brown fox jumps over a lazy dog, with each highlighted part as a TextSpan.

In addition, we might need to adjust the text alignment as well, which is more complicated. Anyway, let's start from the first step for now. Many thanks in advance.

devorel commented 7 months ago

hi for fix it it very simple ... now it write like "olleh" but shuld be "hello" .
just Reverse the word and the words in the sentence. i put a link for Reverse tool . and you can see how its shuld be . https://www.cables.org.il/hebrew.htm and you can see alternatives tool that work well https://github.com/NaorYael/pdf-convert-hebrew-example