christian-vigh-phpclasses / PdfToText

Extracts text from PDF files
Other
123 stars 92 forks source link

error by the /Count parameter #8

Open phisu opened 8 years ago

phisu commented 8 years ago

hello christian.

i get an error concerning page count. i did:

$pdf = new PdfToText ($filename) ; 
echo $pdf->Text;

and i got the following error:


Object #202 : Page count given by the /Count parameter (32) differs from the actual number of objects referenced by the /Kids parameter (6).
PdfToText.php
545
512

the following files produces similar errors:

and the same error on the following file. but a repeating error too: http://www.umweltberatung.at/downloads/mehrweggetraenke-bezugsquellen-abfall.pdf


Undefined offset: 1
/PdfToText.php
2115

philipp

christian-vigh-phpclasses commented 8 years ago

Hello Philipp,

I intentionnally left an exception here because the pdf file format is so tricky regarding page description that I was sure that one day I would encounter a case like yours.

For your curiosity, there is a triple indirection in the way which text objects are contained in which pages :

And you can even find pdf files without any page description at all ! this is the case for example of the official Adobe PDF Specification document…

I suspect that your pdf samples have a little inconsistency ; they say that the page contents for one page are described by 32 objects, while only 6 are referenced. This may be due to a bug in the application that generated it but if this is the case, pdf readers need to be highly tolerant so I will change my class accordingly.

Regarding issue #2 (the repeating error) , I suspect that I need to add a check somewhere.

Ok, I’ll put that in my bug tracking system.

Christian.


De : phisu [mailto:notifications@github.com] Envoyé : mercredi 27 juillet 2016 11:00 À : christian-vigh-phpclasses/PdfToText Objet : [christian-vigh-phpclasses/PdfToText] error by the /Count parameter (#8)

hello christian.

i get an error concerning page count. i did:

$pdf = new PdfToText ($filename) ; echo $pdf->Text;

and i got the following error:

Object #202 : Page count given by the /Count parameter (32) differs from the actual number of objects referenced by the /Kids parameter (6). PdfToText.php 545 512

the following files produces similar errors:

http://www.cleanclothes.at/media/common/uploads/download/cck-label-check/CCK -LabelCheck_screen.pdf https://www.uni-muenster.de/imperia/md/content/physikalische_chemie/praktiku m/h_p_saetze.pdf * http://www.cleanclothes.at/media/common/uploads/download/cck-label-check/CCK -LabelCheck_screen.pdf

and the same error on the following file. but a repeating error too: http://www.umweltberatung.at/downloads/mehrweggetraenke-bezugsquellen-abfall .pdf

Undefined offset: 1 /PdfToText.php 2115

philipp

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it https://github.com/christian-vigh-phpclasses/PdfToText/issues/8 on GitHub, or mute https://github.com/notifications/unsubscribe-auth/ARM8ald5t9481fTbyYBQDRGHj D61DH0Zks5qZx4igaJpZM4JV-mM the thread. https://github.com/notifications/beacon/ARM8akIa8zNncDVJdBVHBpBtLWqwDOhXks5 qZx4igaJpZM4JV-mM.gif


L'absence de virus dans ce courrier électronique a été vérifiée par le logiciel antivirus Avast. https://www.avast.com/antivirus

christian-vigh-phpclasses commented 8 years ago

Hello Philipp,

I corrected the repeating problem of « undefined offset 1 ». This was due to an improper parsing of floating point numbers used for specifying coordinates. A value such as « 0.12 » was recognized, while « .12 » was discarded.

Regarding the warning (« Page count given by the /Count parameter…. »), your samples made me discover that page maps could be nested, the top level page map listing only objects describing further page maps and giving their total count (yet another pdf surprise !).

I disabled this warning in non-debug mode ; I am not yet able to evaluate whether the individual page contents extracted from your samples will be correct ; however, I know that I have to modify the PdfTexterPageMap class in my source to handle this new crazy situation. This is an issue I added to my list of open issues…

Regarding the text positioning issues you reported me in another mail (with extra spaces and extraneous line breaks) , don’t worry, I’m handling them in a separate thread…

Christian.


De : phisu [mailto:notifications@github.com] Envoyé : mercredi 27 juillet 2016 11:00 À : christian-vigh-phpclasses/PdfToText Objet : [christian-vigh-phpclasses/PdfToText] error by the /Count parameter (#8)

hello christian.

i get an error concerning page count. i did:

$pdf = new PdfToText ($filename) ; echo $pdf->Text;

and i got the following error:

Object #202 : Page count given by the /Count parameter (32) differs from the actual number of objects referenced by the /Kids parameter (6). PdfToText.php 545 512

the following files produces similar errors:

http://www.cleanclothes.at/media/common/uploads/download/cck-label-check/CCK -LabelCheck_screen.pdf https://www.uni-muenster.de/imperia/md/content/physikalische_chemie/praktiku m/h_p_saetze.pdf * http://www.cleanclothes.at/media/common/uploads/download/cck-label-check/CCK -LabelCheck_screen.pdf

and the same error on the following file. but a repeating error too: http://www.umweltberatung.at/downloads/mehrweggetraenke-bezugsquellen-abfall .pdf

Undefined offset: 1 /PdfToText.php 2115

philipp

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it https://github.com/christian-vigh-phpclasses/PdfToText/issues/8 on GitHub, or mute https://github.com/notifications/unsubscribe-auth/ARM8ald5t9481fTbyYBQDRGHj D61DH0Zks5qZx4igaJpZM4JV-mM the thread. https://github.com/notifications/beacon/ARM8akIa8zNncDVJdBVHBpBtLWqwDOhXks5 qZx4igaJpZM4JV-mM.gif


L'absence de virus dans ce courrier électronique a été vérifiée par le logiciel antivirus Avast. https://www.avast.com/antivirus