Open liuxunfei opened 2 years ago
Hi liuxunfei, it seems no pdf and docx are uploaded.
Hi dothinking, in windows, use the pdf2docx convert command to convert the above PDF into docx. The pictures, tables, and paragraphs in docx are disorderly, and some paragraphs in the source PDF are turned into tables in docx. Is this the problem of PDF data parsing or the problem of data backfilling during layout restoration when word is finally generated? Is there room for optimization
Many thanks for providing a good case.
Is this the problem of PDF data parsing or the problem of data backfilling during layout restoration when word is finally generated?
It's the problem of layout analysis. Currently, a very simple layout analysis algorithm is applied, focusing on converting the floating layout in PDF to flowing layout in docx, aiming to create the docx in a similar look. Accordingly, you can see tables are commonly used for layout control.
Is there room for optimization
Machine learning is now a powerful technique for layout analysis, but I'm not yet willing to use it because this will increase the installation / setup difficulty, e.g., tensorflow, or pytorch, especially for the elementary users. I'm now trying traditional computer vision method with python-opencv, but might need more time for a release.
1804.10371.pdf 1804.10371.docx