ArtifexSoftware / pdf2docx

Open source Python library for converting PDF to DOCX.
https://pdf2docx.readthedocs.io
GNU Affero General Public License v3.0
2.46k stars 356 forks source link

关于行高分配的逻辑疑问 #291

Closed heweisheng closed 1 week ago

heweisheng commented 3 months ago

最近在做ocr还原扫描件(使用飞浆的面版识别+reportlib生成还原pdf),目前pdf排版比较方便,所以打算先转pdf在用pdf2docx(花时间写一套根据ocr实现排版感觉可以直接扩展这个项目,但是暂时还没有时间去扩展) 看了下pdf解析的时候可能存在多行一个段落的情况,但是多行的情况下行高应该要均分给每一行才对 会出现问题的具体情况: test_7.pdf

image image 使用这个逻辑转换: image 均分行高: image 另外可否中间插入空格行去做到排版尽量跟原来相似呢?

greendreamer commented 3 weeks ago

Currently, when converting pdf to docx, I get same result as pdf. Please check again.

greendreamer commented 1 week ago

Closing this for lack of reaction for an extended amount of time.