ArtifexSoftware / pdf2docx

Open source Python library for converting PDF to DOCX.
https://pdf2docx.readthedocs.io
GNU Affero General Public License v3.0
2.5k stars 369 forks source link

Is there any way to improve the layout restoration? #137

Open liuxunfei opened 2 years ago

liuxunfei commented 2 years ago

1804.10371.pdf 1804.10371.docx

dothinking commented 2 years ago

Hi liuxunfei, it seems no pdf and docx are uploaded.

liuxunfei commented 2 years ago

src.pdf dst.docx

Hi dothinking, in windows, use the pdf2docx convert command to convert the above PDF into docx. The pictures, tables, and paragraphs in docx are disorderly, and some paragraphs in the source PDF are turned into tables in docx. Is this the problem of PDF data parsing or the problem of data backfilling during layout restoration when word is finally generated? Is there room for optimization

dothinking commented 2 years ago

Many thanks for providing a good case.

Is this the problem of PDF data parsing or the problem of data backfilling during layout restoration when word is finally generated?

It's the problem of layout analysis. Currently, a very simple layout analysis algorithm is applied, focusing on converting the floating layout in PDF to flowing layout in docx, aiming to create the docx in a similar look. Accordingly, you can see tables are commonly used for layout control.

Is there room for optimization

Machine learning is now a powerful technique for layout analysis, but I'm not yet willing to use it because this will increase the installation / setup difficulty, e.g., tensorflow, or pytorch, especially for the elementary users. I'm now trying traditional computer vision method with python-opencv, but might need more time for a release.