Open phisu opened 8 years ago
Hello Philipp,
I intentionnally left an exception here because the pdf file format is so tricky regarding page description that I was sure that one day I would encounter a case like yours.
For your curiosity, there is a triple indirection in the way which text objects are contained in which pages :
Object #x contains a keyword that specifies a certain number of
objects y1, ., yn1
Each object y1,
, y1 references objects z1,
, zn2. These are
the contents for one page
In turn, each object z1,
zn2 lists the object that contain the
text drawing instructions to draw a part of the page
And you can even find pdf files without any page description at all ! this is the case for example of the official Adobe PDF Specification document
I suspect that your pdf samples have a little inconsistency ; they say that the page contents for one page are described by 32 objects, while only 6 are referenced. This may be due to a bug in the application that generated it but if this is the case, pdf readers need to be highly tolerant so I will change my class accordingly.
Regarding issue #2 (the repeating error) , I suspect that I need to add a check somewhere.
Ok, Ill put that in my bug tracking system.
Christian.
De : phisu [mailto:notifications@github.com] Envoyé : mercredi 27 juillet 2016 11:00 À : christian-vigh-phpclasses/PdfToText Objet : [christian-vigh-phpclasses/PdfToText] error by the /Count parameter (#8)
hello christian.
i get an error concerning page count. i did:
$pdf = new PdfToText ($filename) ; echo $pdf->Text;
and i got the following error:
Object #202 : Page count given by the /Count parameter (32) differs from the actual number of objects referenced by the /Kids parameter (6). PdfToText.php 545 512
the following files produces similar errors:
http://www.cleanclothes.at/media/common/uploads/download/cck-label-check/CCK -LabelCheck_screen.pdf https://www.uni-muenster.de/imperia/md/content/physikalische_chemie/praktiku m/h_p_saetze.pdf * http://www.cleanclothes.at/media/common/uploads/download/cck-label-check/CCK -LabelCheck_screen.pdf
and the same error on the following file. but a repeating error too: http://www.umweltberatung.at/downloads/mehrweggetraenke-bezugsquellen-abfall .pdf
Undefined offset: 1 /PdfToText.php 2115
philipp
You are receiving this because you are subscribed to this thread. Reply to this email directly, view it https://github.com/christian-vigh-phpclasses/PdfToText/issues/8 on GitHub, or mute https://github.com/notifications/unsubscribe-auth/ARM8ald5t9481fTbyYBQDRGHj D61DH0Zks5qZx4igaJpZM4JV-mM the thread. https://github.com/notifications/beacon/ARM8akIa8zNncDVJdBVHBpBtLWqwDOhXks5 qZx4igaJpZM4JV-mM.gif
L'absence de virus dans ce courrier électronique a été vérifiée par le logiciel antivirus Avast. https://www.avast.com/antivirus
Hello Philipp,
I corrected the repeating problem of « undefined offset 1 ». This was due to an improper parsing of floating point numbers used for specifying coordinates. A value such as « 0.12 » was recognized, while « .12 » was discarded.
Regarding the warning (« Page count given by the /Count parameter . »), your samples made me discover that page maps could be nested, the top level page map listing only objects describing further page maps and giving their total count (yet another pdf surprise !).
I disabled this warning in non-debug mode ; I am not yet able to evaluate whether the individual page contents extracted from your samples will be correct ; however, I know that I have to modify the PdfTexterPageMap class in my source to handle this new crazy situation. This is an issue I added to my list of open issues
Regarding the text positioning issues you reported me in another mail (with extra spaces and extraneous line breaks) , dont worry, Im handling them in a separate thread
Christian.
De : phisu [mailto:notifications@github.com] Envoyé : mercredi 27 juillet 2016 11:00 À : christian-vigh-phpclasses/PdfToText Objet : [christian-vigh-phpclasses/PdfToText] error by the /Count parameter (#8)
hello christian.
i get an error concerning page count. i did:
$pdf = new PdfToText ($filename) ; echo $pdf->Text;
and i got the following error:
Object #202 : Page count given by the /Count parameter (32) differs from the actual number of objects referenced by the /Kids parameter (6). PdfToText.php 545 512
the following files produces similar errors:
http://www.cleanclothes.at/media/common/uploads/download/cck-label-check/CCK -LabelCheck_screen.pdf https://www.uni-muenster.de/imperia/md/content/physikalische_chemie/praktiku m/h_p_saetze.pdf * http://www.cleanclothes.at/media/common/uploads/download/cck-label-check/CCK -LabelCheck_screen.pdf
and the same error on the following file. but a repeating error too: http://www.umweltberatung.at/downloads/mehrweggetraenke-bezugsquellen-abfall .pdf
Undefined offset: 1 /PdfToText.php 2115
philipp
You are receiving this because you are subscribed to this thread. Reply to this email directly, view it https://github.com/christian-vigh-phpclasses/PdfToText/issues/8 on GitHub, or mute https://github.com/notifications/unsubscribe-auth/ARM8ald5t9481fTbyYBQDRGHj D61DH0Zks5qZx4igaJpZM4JV-mM the thread. https://github.com/notifications/beacon/ARM8akIa8zNncDVJdBVHBpBtLWqwDOhXks5 qZx4igaJpZM4JV-mM.gif
L'absence de virus dans ce courrier électronique a été vérifiée par le logiciel antivirus Avast. https://www.avast.com/antivirus
hello christian.
i get an error concerning page count. i did:
and i got the following error:
the following files produces similar errors:
and the same error on the following file. but a repeating error too: http://www.umweltberatung.at/downloads/mehrweggetraenke-bezugsquellen-abfall.pdf
philipp