ArtifexSoftware / pdf2docx

Open source Python library for converting PDF to DOCX.
https://pdf2docx.readthedocs.io
GNU Affero General Public License v3.0
2.46k stars 356 forks source link

无法复原pdf文件中表格的框线 #279

Open ericosmic opened 5 months ago

ericosmic commented 5 months ago

在识别pdf中发现存在两个问题, 1 无法在docx文件中还原 pdf文件中的隐藏表格的一部分显示线段, 比如样本中的红线是一个表格的一条框线。 2 文字段落无法实现首行缩进

样本如下图: image zf1.pdf

zhangdanfenggg commented 4 months ago
image

一样的问题,转docx的时候横线转不成功,还报错这个: [INFO] Start to convert D:/Download/aab.pdf [INFO] [1/4] Opening document... [INFO] [2/4] Analyzing document... [WARNING] Ignore Line "𝑘𝐿\udc40" due to overlap [WARNING] Ignore Line "𝑘" due to overlap [INFO] [3/4] Parsing pages... [INFO] (1/18) Page 1 [INFO] (2/18) Page 2 [INFO] (3/18) Page 3 [INFO] (4/18) Page 4 [INFO] (5/18) Page 5 [INFO] (6/18) Page 6 [INFO] (7/18) Page 7 [INFO] (8/18) Page 8 [INFO] (9/18) Page 9 [INFO] (10/18) Page 10 [INFO] (11/18) Page 11 [INFO] (12/18) Page 12 [INFO] (13/18) Page 13 [INFO] (14/18) Page 14 [ERROR] Ignore page 14 due to parsing page error: 'utf-8' codec can't encode character '\udc54' in position 0: surrogates not allowed [INFO] (15/18) Page 15 [ERROR] Ignore page 15 due to parsing page error: 'utf-8' codec can't encode character '\udc59' in position 0: surrogates not allowed [INFO] (16/18) Page 16 [INFO] (17/18) Page 17 [INFO] (18/18) Page 18 [INFO] [4/4] Creating pages... [INFO] (1/16) Page 1 [INFO] (2/16) Page 2 [INFO] (3/16) Page 3 [INFO] (4/16) Page 4 [INFO] (5/16) Page 5 [INFO] (6/16) Page 6 [ERROR] Ignore page 6 due to making page error: 'utf-8' codec can't encode character '\udc40' in position 2: surrogates not allowed [INFO] (7/16) Page 7 [INFO] (8/16) Page 8 [INFO] (9/16) Page 9 [INFO] (10/16) Page 10 [INFO] (11/16) Page 11 [INFO] (12/16) Page 12 [INFO] (13/16) Page 13 [INFO] (14/16) Page 16 [INFO] (15/16) Page 17 [INFO] (16/16) Page 18 [INFO] Terminated in 1.70s. File Converted Successfully aab.pdf