cbrunet / python-poppler

Python binding to Poppler-cpp pdf library
GNU General Public License v2.0
95 stars 15 forks source link

Segmentation fault (core dumped) #64

Open avirala-eightfold opened 2 years ago

avirala-eightfold commented 2 years ago

Hi, Thank you for this amazing work. Recently I was working with some pdf and poppler was working great for most of them but for some of those pdf I am seeing the following error:-

Segmentation fault (core dumped)

Considering this is a memory issue I also can't put it in a try & catch to prevent my code from rebooting the workers again and again just to be stuck over there. This has been a major problem for me. To give you some context and debugging that I have gone ahead with:-

  1. The segmentation error happens when I call:- page.text_list(page.TextListOption.text_list_include_font)
  2. If I remove the optional enum, the error does not come anymore, also if I use pdf_document.create_font_iterator(), this also works but while getting this on the text_box level I face this error
  3. As soon as it hits:- boxes = self._page.text_list(opt_flag) in page.py the code is stopped with the error
  4. I initially thought that this might be an upstream error in the CPP code itself, but other libraries which are based on poppler itself seem to work fine on this pdf, hence my thought that something must be happening in the python bindings

The metadata for the pdf that I see such errors with is mostly (not always):-

{'Producer': 'macOS Version 11.2.3 (Build 20D91) Quartz PDFContext', 'Creator': 'Pages'}

The code to repro the error:-

from poppler import load_from_file
file_path = "sample_pdf.pdf"
pdf_document = load_from_file(file_path)
no_of_pages = pdf_document.pages
for page_ind in range(no_of_pages):
    page = pdf_document.create_page(page_ind)
    text_list = page.text_list(page.TextListOption.text_list_include_font)

The link to the pdf:- https://drive.google.com/file/d/180CDGyiJRfytvuzVsAiYKppHvaBABGkJ/view?usp=sharing Please request access to the pdf as I can't share it publically. (Really sorry for this, but I hope you understand)

avirala-eightfold commented 1 year ago

Hi, @cbrunet @bzamecnik I have encountered many such files. Is it possible to not return font information in these cases and just the text and prevent a core dump?

avirala-eightfold commented 1 year ago

@cbrunet @bzamecnik sorry to tag you guys again but this issue has increased and can be now seen on a lot many documents. It will be really helpful if you can take a look at it.

avirala-eightfold commented 1 year ago

@bzamecnik by any chance did you take a look at this once? Sorry for tagging again.

bzamecnik commented 1 year ago

@avirala-eightfold Hi, I can possibly check that. I made the request for sharing the file. Have you managed to confirm it yet?

avirala-eightfold commented 1 year ago

@bzamecnik Yes yes I somehow missed it sorry for the delay. I have shared it again can you please confirm if you can access it?

avirala-eightfold commented 1 year ago

@avirala-eightfold Hi, I can possibly check that. I made the request for sharing the file. Have you managed to confirm it yet? @bzamecnik did you get a chance to look at it?

avirala-eightfold commented 1 year ago

Sorry for bugging you again but @bzamecnik did you get a chance to look into it?

bzamecnik commented 1 year ago

@avirala-eightfold Sorry, no I didn't have chance to look at it. Is there anything that prevents you to investigate it?

UPDATE: I can confirm that it crashes on a Segmentation fault. That's all I can see without rebuilding the code. 🤷

Running with gdb gives some hint:

$ gdb
(gdb) file python
Reading symbols from python...
(gdb) run script.py
Starting program: /usr/local/bin/python script.py
warning: Error disabling address space randomization: Operation not permitted
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/aarch64-linux-gnu/libthread_db.so.1".

Program received signal SIGSEGV, Segmentation fault.
0x0000ffff990e54a8 in TextFontInfo::matches(Ref const*) const () from /usr/lib/aarch64-linux-gnu/libpoppler.so.102
(gdb) 

Enabling the faulthandler module gives a similar hint:

import faulthandler

faulthandler.enable()

# ... rest of the code...
Fatal Python error: Segmentation fault

Current thread 0x0000ffffa61f8d30 (most recent call first):
  File "/usr/local/lib/python3.11/site-packages/poppler/page.py", line 128 in text_list
  File "/usr/local/lib/python3.11/site-packages/poppler/utilities.py", line 90 in wrapped
  File "/app/script.py", line 11 in <module>
Segmentation fault

Some fiddling with the code:

Looking at the gdb output, the crash may come from this place: https://github.com/freedesktop/poppler/blob/master/cpp/poppler-page.cpp#L461

if (cur_text_font_info->matches(&(tb_font_info->font_info_cache[k].d->ref))) {

...which would mean some reference to the font info is wrong (either cur_text_font_info or the one in the cache.

avirala-eightfold commented 1 year ago

Thank you so much for looking into it, let me try to take this as the base and move forward to find anything else