Open avirala-eightfold opened 2 years ago
Hi, @cbrunet @bzamecnik I have encountered many such files. Is it possible to not return font information in these cases and just the text and prevent a core dump?
@cbrunet @bzamecnik sorry to tag you guys again but this issue has increased and can be now seen on a lot many documents. It will be really helpful if you can take a look at it.
@bzamecnik by any chance did you take a look at this once? Sorry for tagging again.
@avirala-eightfold Hi, I can possibly check that. I made the request for sharing the file. Have you managed to confirm it yet?
@bzamecnik Yes yes I somehow missed it sorry for the delay. I have shared it again can you please confirm if you can access it?
@avirala-eightfold Hi, I can possibly check that. I made the request for sharing the file. Have you managed to confirm it yet? @bzamecnik did you get a chance to look at it?
Sorry for bugging you again but @bzamecnik did you get a chance to look into it?
@avirala-eightfold Sorry, no I didn't have chance to look at it. Is there anything that prevents you to investigate it?
UPDATE: I can confirm that it crashes on a Segmentation fault. That's all I can see without rebuilding the code. 🤷
Running with gdb
gives some hint:
$ gdb
(gdb) file python
Reading symbols from python...
(gdb) run script.py
Starting program: /usr/local/bin/python script.py
warning: Error disabling address space randomization: Operation not permitted
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/aarch64-linux-gnu/libthread_db.so.1".
Program received signal SIGSEGV, Segmentation fault.
0x0000ffff990e54a8 in TextFontInfo::matches(Ref const*) const () from /usr/lib/aarch64-linux-gnu/libpoppler.so.102
(gdb)
Enabling the faulthandler
module gives a similar hint:
import faulthandler
faulthandler.enable()
# ... rest of the code...
Fatal Python error: Segmentation fault
Current thread 0x0000ffffa61f8d30 (most recent call first):
File "/usr/local/lib/python3.11/site-packages/poppler/page.py", line 128 in text_list
File "/usr/local/lib/python3.11/site-packages/poppler/utilities.py", line 90 in wrapped
File "/app/script.py", line 11 in <module>
Segmentation fault
Some fiddling with the code:
qpdf --check sample_pdf.pdf
does not see any errorsLooking at the gdb output, the crash may come from this place: https://github.com/freedesktop/poppler/blob/master/cpp/poppler-page.cpp#L461
if (cur_text_font_info->matches(&(tb_font_info->font_info_cache[k].d->ref))) {
...which would mean some reference to the font info is wrong (either cur_text_font_info
or the one in the cache.
Thank you so much for looking into it, let me try to take this as the base and move forward to find anything else
Hi, Thank you for this amazing work. Recently I was working with some pdf and poppler was working great for most of them but for some of those pdf I am seeing the following error:-
Considering this is a memory issue I also can't put it in a try & catch to prevent my code from rebooting the workers again and again just to be stuck over there. This has been a major problem for me. To give you some context and debugging that I have gone ahead with:-
page.text_list(page.TextListOption.text_list_include_font)
pdf_document.create_font_iterator()
, this also works but while getting this on the text_box level I face this errorboxes = self._page.text_list(opt_flag)
inpage.py
the code is stopped with the errorThe metadata for the pdf that I see such errors with is mostly (not always):-
The code to repro the error:-
The link to the pdf:- https://drive.google.com/file/d/180CDGyiJRfytvuzVsAiYKppHvaBABGkJ/view?usp=sharing Please request access to the pdf as I can't share it publically. (Really sorry for this, but I hope you understand)