Error encountered when using parse() function on PDF file

wwaguai commented 7 months ago

Hello,

I encountered an error while using the parse() function to convert a PDF file. The error message I received is as follows: "[ERROR] in method 'TextWriter_append', argument 3 of type 'char *'". I would appreciate any assistance in resolving this issue.

Thank you! zijie.pdf

dothinking commented 7 months ago

@wwaguai The attachment link seems not correct. Please check and re-upload it, thanks.

wwaguai commented 7 months ago

@wwaguai The attachment link seems not correct. Please check and re-upload it, thanks.

Sorry, I have re-uploaded the file.

dothinking commented 7 months ago

Didn't reproduce this issue, with the latest version pdf2docx==0.5.7 and pymupdf==1.23.16. It seems an upstream issue from pymupdf, please upgrade it to the latest version and have a try.

pip install pymupdf==1.23.6

>>> from pdf2docx import parse
>>> parse('zijie.pdf')
[INFO] Start to convert test.pdf
[INFO] [1/4] Opening document...
[INFO] [2/4] Analyzing document...
[INFO] [3/4] Parsing pages...
[INFO] (1/13) Page 1
[INFO] (2/13) Page 2
[INFO] (3/13) Page 3
[INFO] (4/13) Page 4
[INFO] (5/13) Page 5
[INFO] (6/13) Page 6
[INFO] (7/13) Page 7
[INFO] (8/13) Page 8
[INFO] (9/13) Page 9
[INFO] (10/13) Page 10
[INFO] (11/13) Page 11
[INFO] (12/13) Page 12
[INFO] (13/13) Page 13
[INFO] [4/4] Creating pages...
[INFO] (1/13) Page 1
[INFO] (2/13) Page 2
[INFO] (3/13) Page 3
[INFO] (4/13) Page 4
[INFO] (5/13) Page 5
[INFO] (6/13) Page 6
[INFO] (7/13) Page 7
[INFO] (8/13) Page 8
[INFO] (9/13) Page 9
[INFO] (10/13) Page 10
[INFO] (11/13) Page 11
[INFO] (12/13) Page 12
[INFO] (13/13) Page 13
[INFO] Terminated in 2.35s.

dothinking commented 7 months ago

However, I found three issues from the converted results. Thanks for this good test file. Fortunately, these minor issues will be fixed in the coming v0.5.8.

empty font name
invalid character � -> it's the replacement character in unicode \ufffd. We should ignore it though extracted by pymupdf from pdf file.
wrong spacing before paragraph -> caused by an invisible shape which is larger than the page size

wwaguai commented 7 months ago

Didn't reproduce this issue, with the latest version pdf2docx==0.5.7 and pymupdf==1.23.16. It seems an upstream issue from pymupdf, please upgrade it to the latest version and have a try.

pip install pymupdf==1.23.6

>>> from pdf2docx import parse
>>> parse('zijie.pdf')
[INFO] Start to convert test.pdf
[INFO] [1/4] Opening document...
[INFO] [2/4] Analyzing document...
[INFO] [3/4] Parsing pages...
[INFO] (1/13) Page 1
[INFO] (2/13) Page 2
[INFO] (3/13) Page 3
[INFO] (4/13) Page 4
[INFO] (5/13) Page 5
[INFO] (6/13) Page 6
[INFO] (7/13) Page 7
[INFO] (8/13) Page 8
[INFO] (9/13) Page 9
[INFO] (10/13) Page 10
[INFO] (11/13) Page 11
[INFO] (12/13) Page 12
[INFO] (13/13) Page 13
[INFO] [4/4] Creating pages...
[INFO] (1/13) Page 1
[INFO] (2/13) Page 2
[INFO] (3/13) Page 3
[INFO] (4/13) Page 4
[INFO] (5/13) Page 5
[INFO] (6/13) Page 6
[INFO] (7/13) Page 7
[INFO] (8/13) Page 8
[INFO] (9/13) Page 9
[INFO] (10/13) Page 10
[INFO] (11/13) Page 11
[INFO] (12/13) Page 12
[INFO] (13/13) Page 13
[INFO] Terminated in 2.35s.

Thanks, it‘s useful

dothinking commented 7 months ago

v0.5.8 was released. You can upgrade to the latest version to get better conversion results.

pip install pdf2docx --upgrade

ArtifexSoftware / pdf2docx

Error encountered when using parse() function on PDF file #256