gettalong / hexapdf

Versatile PDF creation and manipulation for Ruby
https://hexapdf.gettalong.org
Other
1.21k stars 69 forks source link

HexaPDF Fails to Detect Pages in a PDF Document #303

Closed Jorge-Signwell closed 3 months ago

Jorge-Signwell commented 3 months ago

Issue: HexaPDF Fails to Detect Pages in a PDF Document

Description:

I am encountering an issue with the HexaPDF gem where it fails to detect the pages in a PDF document. The document has 17 pages, but when I attempt to open it and count the pages using HexaPDF, it returns a count of 0.

Steps to Reproduce:

  1. Create a PDF document with multiple pages (the document I used has 17 pages).

  2. Use the following Ruby script to open the PDF and count its pages:

    require 'hexapdf'
    
    path = './output.pdf'
    
    # Verify if the file exists before attempting to open it
    document = HexaPDF::Document.open(path)
    
    puts document.pages.count
  3. Run the script.

    ❯ ruby main.rb
    0

Expected Behavior:

The script should output the correct number of pages in the PDF (17 in this case).

Actual Behavior:

The script outputs 0, indicating that no pages are detected in the PDF.

Additional Information:

❯ hexapdf info --check output.pdf 

WARNING: Parse error at position 0: PDF file trailer with end-of-file marker not found - trying cross-reference table reconstruction
WARNING: Validation error for trailer: ID field should always be set (correctable)
WARNING: Validation error for sub-object of object type Catalog (2,0): A PDF document needs a page tree (correctable)
WARNING: Validation error for object type Pages (407,0): A PDF document needs at least one page (correctable)
ERROR: Stream of object (73,0) invalid: Problem while decoding Flate encoded stream: incorrect header check
ERROR: Stream of object (74,0) invalid: Problem while decoding Flate encoded stream: incorrect header check
ERROR: Stream of object (75,0) invalid: Problem while decoding Flate encoded stream: incorrect header check
ERROR: Stream of object (76,0) invalid: Problem while decoding Flate encoded stream: incorrect header check
ERROR: Stream of object (77,0) invalid: Problem while decoding Flate encoded stream: incorrect header check
ERROR: Stream of object (78,0) invalid: Problem while decoding Flate encoded stream: incorrect header check
ERROR: Stream of object (79,0) invalid: Problem while decoding Flate encoded stream: incorrect header check
ERROR: Stream of object (80,0) invalid: Problem while decoding Flate encoded stream: incorrect header check
ERROR: Stream of object (66,0) invalid: Problem while decoding Flate encoded stream: incorrect header check
ERROR: Stream of object (69,0) invalid: Problem while decoding Flate encoded stream: incorrect header check
ERROR: Stream of object (71,0) invalid: Problem while decoding Flate encoded stream: incorrect header check
ERROR: Stream of object (72,0) invalid: Problem while decoding Flate encoded stream: incorrect header check
ERROR: Stream of object (81,0) invalid: Problem while decoding Flate encoded stream: incorrect header check
ERROR: Stream of object (82,0) invalid: Problem while decoding Flate encoded stream: incorrect header check
ERROR: Stream of object (89,0) invalid: Problem while decoding Flate encoded stream: incorrect header check
ERROR: Stream of object (90,0) invalid: Problem while decoding Flate encoded stream: incorrect header check
ERROR: Stream of object (91,0) invalid: Problem while decoding Flate encoded stream: incorrect header check
ERROR: Stream of object (92,0) invalid: Problem while decoding Flate encoded stream: incorrect header check
ERROR: Stream of object (93,0) invalid: Problem while decoding Flate encoded stream: incorrect header check
ERROR: Stream of object (94,0) invalid: Problem while decoding Flate encoded stream: unknown compression method
ERROR: Stream of object (97,0) invalid: Problem while decoding Flate encoded stream: incorrect header check
ERROR: Stream of object (101,0) invalid: Problem while decoding Flate encoded stream: incorrect header check
ERROR: Stream of object (107,0) invalid: Problem while decoding Flate encoded stream: incorrect header check
ERROR: Stream of object (110,0) invalid: Problem while decoding Flate encoded stream: incorrect header check
ERROR: Stream of object (116,0) invalid: Problem while decoding Flate encoded stream: incorrect header check
ERROR: Stream of object (122,0) invalid: Problem while decoding Flate encoded stream: incorrect header check
ERROR: Stream of object (125,0) invalid: Problem while decoding Flate encoded stream: incorrect header check
ERROR: Stream of object (126,0) invalid: Problem while decoding Flate encoded stream: incorrect header check
ERROR: Stream of object (132,0) invalid: Problem while decoding Flate encoded stream: incorrect header check
ERROR: Stream of object (141,0) invalid: Problem while decoding Flate encoded stream: incorrect header check
ERROR: Stream of object (145,0) invalid: Problem while decoding Flate encoded stream: incorrect header check
ERROR: Stream of object (153,0) invalid: Problem while decoding Flate encoded stream: incorrect header check
ERROR: Stream of object (163,0) invalid: Problem while decoding Flate encoded stream: incorrect header check
ERROR: Stream of object (170,0) invalid: Problem while decoding Flate encoded stream: incorrect header check
ERROR: Stream of object (173,0) invalid: Problem while decoding Flate encoded stream: incorrect header check
ERROR: Stream of object (176,0) invalid: Problem while decoding Flate encoded stream: incorrect header check
ERROR: Stream of object (179,0) invalid: Problem while decoding Flate encoded stream: unknown compression method
ERROR: Stream of object (182,0) invalid: Problem while decoding Flate encoded stream: incorrect header check
ERROR: Stream of object (185,0) invalid: Problem while decoding Flate encoded stream: incorrect header check
ERROR: Stream of object (188,0) invalid: Problem while decoding Flate encoded stream: incorrect header check
ERROR: Stream of object (191,0) invalid: Problem while decoding Flate encoded stream: incorrect header check
ERROR: Stream of object (194,0) invalid: Problem while decoding Flate encoded stream: incorrect header check
ERROR: Stream of object (195,0) invalid: Problem while decoding Flate encoded stream: incorrect header check
ERROR: Stream of object (196,0) invalid: Problem while decoding Flate encoded stream: incorrect header check
ERROR: Stream of object (208,0) invalid: Problem while decoding Flate encoded stream: incorrect header check
ERROR: Stream of object (209,0) invalid: Problem while decoding Flate encoded stream: incorrect header check
ERROR: Stream of object (232,0) invalid: Problem while decoding Flate encoded stream: incorrect header check
ERROR: Stream of object (233,0) invalid: Problem while decoding Flate encoded stream: incorrect header check
ERROR: Stream of object (245,0) invalid: Problem while decoding Flate encoded stream: incorrect header check
ERROR: Stream of object (246,0) invalid: Problem while decoding Flate encoded stream: incorrect header check
ERROR: Stream of object (256,0) invalid: Problem while decoding Flate encoded stream: incorrect header check
ERROR: Stream of object (270,0) invalid: Problem while decoding Flate encoded stream: incorrect header check
ERROR: Stream of object (272,0) invalid: Problem while decoding Flate encoded stream: incorrect header check
ERROR: Stream of object (273,0) invalid: Problem while decoding Flate encoded stream: incorrect header check
ERROR: Stream of object (274,0) invalid: Problem while decoding Flate encoded stream: incorrect header check
ERROR: Stream of object (275,0) invalid: Problem while decoding Flate encoded stream: unknown compression method
ERROR: Stream of object (276,0) invalid: Problem while decoding Flate encoded stream: incorrect header check
ERROR: Stream of object (277,0) invalid: Problem while decoding Flate encoded stream: incorrect header check
ERROR: Stream of object (278,0) invalid: Problem while decoding Flate encoded stream: incorrect header check
ERROR: Stream of object (280,0) invalid: Problem while decoding Flate encoded stream: incorrect header check
ERROR: Stream of object (282,0) invalid: Problem while decoding Flate encoded stream: incorrect header check
ERROR: Stream of object (284,0) invalid: Problem while decoding Flate encoded stream: incorrect header check
ERROR: Stream of object (285,0) invalid: Problem while decoding Flate encoded stream: incorrect header check
ERROR: Stream of object (287,0) invalid: Problem while decoding Flate encoded stream: incorrect header check
ERROR: Stream of object (289,0) invalid: Problem while decoding Flate encoded stream: incorrect header check
ERROR: Stream of object (292,0) invalid: Problem while decoding Flate encoded stream: incorrect header check
ERROR: Stream of object (293,0) invalid: Problem while decoding Flate encoded stream: incorrect header check
ERROR: Stream of object (295,0) invalid: Problem while decoding Flate encoded stream: incorrect header check
ERROR: Stream of object (297,0) invalid: Problem while decoding Flate encoded stream: incorrect header check
ERROR: Stream of object (300,0) invalid: Problem while decoding Flate encoded stream: incorrect header check
ERROR: Stream of object (302,0) invalid: Problem while decoding Flate encoded stream: incorrect header check
ERROR: Stream of object (304,0) invalid: Problem while decoding Flate encoded stream: incorrect header check
ERROR: Stream of object (306,0) invalid: Problem while decoding Flate encoded stream: incorrect header check
ERROR: Stream of object (308,0) invalid: Problem while decoding Flate encoded stream: incorrect header check
ERROR: Stream of object (309,0) invalid: Problem while decoding Flate encoded stream: incorrect header check
ERROR: Stream of object (311,0) invalid: Problem while decoding Flate encoded stream: incorrect header check
ERROR: Stream of object (314,0) invalid: Problem while decoding Flate encoded stream: incorrect header check
ERROR: Stream of object (315,0) invalid: Problem while decoding Flate encoded stream: incorrect header check
ERROR: Stream of object (317,0) invalid: Problem while decoding Flate encoded stream: incorrect header check
ERROR: Stream of object (320,0) invalid: Problem while decoding Flate encoded stream: incorrect header check
ERROR: Stream of object (337,0) invalid: Problem while decoding Flate encoded stream: incorrect header check
ERROR: Stream of object (360,0) invalid: Problem while decoding Flate encoded stream: incorrect header check
ERROR: Stream of object (367,0) invalid: Problem while decoding Flate encoded stream: incorrect header check
ERROR: Stream of object (368,0) invalid: Problem while decoding Flate encoded stream: unknown compression method
ERROR: Stream of object (369,0) invalid: Problem while decoding Flate encoded stream: incorrect header check
ERROR: Stream of object (370,0) invalid: Problem while decoding Flate encoded stream: incorrect header check
ERROR: Stream of object (371,0) invalid: Problem while decoding Flate encoded stream: incorrect header check
ERROR: Stream of object (372,0) invalid: Problem while decoding Flate encoded stream: incorrect header check
ERROR: Stream of object (373,0) invalid: Problem while decoding Flate encoded stream: unknown compression method
ERROR: Stream of object (374,0) invalid: Problem while decoding Flate encoded stream: incorrect header check
ERROR: Stream of object (375,0) invalid: Problem while decoding Flate encoded stream: unknown compression method
ERROR: Stream of object (376,0) invalid: Problem while decoding Flate encoded stream: incorrect header check
ERROR: Stream of object (377,0) invalid: Problem while decoding Flate encoded stream: incorrect header check
ERROR: Stream of object (378,0) invalid: Problem while decoding Flate encoded stream: incorrect header check
ERROR: Stream of object (379,0) invalid: Problem while decoding Flate encoded stream: incorrect header check
ERROR: Stream of object (380,0) invalid: Problem while decoding Flate encoded stream: incorrect header check
ERROR: Stream of object (381,0) invalid: Problem while decoding Flate encoded stream: incorrect header check
ERROR: Stream of object (382,0) invalid: Problem while decoding Flate encoded stream: incorrect header check
ERROR: Stream of object (383,0) invalid: Problem while decoding Flate encoded stream: incorrect header check
ERROR: Stream of object (384,0) invalid: Problem while decoding Flate encoded stream: incorrect header check
ERROR: Stream of object (385,0) invalid: Problem while decoding Flate encoded stream: incorrect header check
ERROR: Stream of object (386,0) invalid: Problem while decoding Flate encoded stream: incorrect header check
ERROR: Stream of object (387,0) invalid: Problem while decoding Flate encoded stream: incorrect header check
ERROR: Stream of object (388,0) invalid: Problem while decoding Flate encoded stream: incorrect header check
ERROR: Stream of object (389,0) invalid: Problem while decoding Flate encoded stream: incorrect header check
ERROR: Stream of object (390,0) invalid: Problem while decoding Flate encoded stream: incorrect header check
ERROR: Stream of object (391,0) invalid: Problem while decoding Flate encoded stream: incorrect header check
ERROR: Stream of object (392,0) invalid: Problem while decoding Flate encoded stream: unknown compression method
ERROR: Stream of object (393,0) invalid: Problem while decoding Flate encoded stream: incorrect header check
ERROR: Stream of object (394,0) invalid: Problem while decoding Flate encoded stream: incorrect header check
ERROR: Stream of object (395,0) invalid: Problem while decoding Flate encoded stream: incorrect header check
ERROR: Stream of object (396,0) invalid: Problem while decoding Flate encoded stream: incorrect header check
ERROR: Stream of object (397,0) invalid: Problem while decoding Flate encoded stream: incorrect header check
ERROR: Stream of object (399,0) invalid: Problem while decoding Flate encoded stream: incorrect header check
ERROR: Stream of object (400,0) invalid: Problem while decoding Flate encoded stream: incorrect header check
ERROR: Stream of object (401,0) invalid: Problem while decoding Flate encoded stream: incorrect header check
ERROR: Stream of object (402,0) invalid: Problem while decoding Flate encoded stream: incorrect header check
ERROR: Stream of object (403,0) invalid: Problem while decoding Flate encoded stream: incorrect header check
ERROR: Stream of object (404,0) invalid: Problem while decoding Flate encoded stream: incorrect header check
ERROR: Stream of object (405,0) invalid: Problem while decoding Flate encoded stream: incorrect header check
File name:          output.pdf
File size:          1846563 bytes
Pages:              1
Version:            1.5
Reconstructed:      yes (use --check for details)

Please let me know if you need any additional information to investigate this issue. Thank you for your assistance.

gettalong commented 3 months ago

Okay, that file seems to be corrupt since HexaPDF needs to do a cross-reference reconstruction which means it parses the file from top to bottom and tries to find all PDF objects. The current algorithm to do this works for many slightly corrupt or invalid files but certainly doesn't work for all files.

To find out the real cause of why HexaPDF can't reconstruct the page tree, I would need to inspect the file. If possible, attach it to the issue or otherwise please send it to info@gettalong.at. Without that file I won't be able to help you.

Jorge-Signwell commented 3 months ago

Hi @gettalong,

Following up on my previous comment, I've sent the corrupt PDF file to info@gettalong.at for your reference.

Hopefully, this will help diagnose the issue with reconstructing the page tree.

Thanks again for your help!

gettalong commented 3 months ago

@Jorge-Signwell Thanks for the file! I found the problem and will implement a fix.

gettalong commented 3 months ago

@Jorge-Signwell I have fixed parsing of such invalid files and they work fine now. Release 0.43.0 with the fix will be available within the hour.