Open phisu opened 8 years ago
Hi,
Thanks for submitting this issue which shows yet another way to encode images in a pdf file.
Dont worry, your code is perfectly correct !
To tell the truth, for the moment, Ill ask you to be a little bit patient !
In fact, when I implemented image extraction, I decided to throw an exception when encountering unhandled ways of encoding images. My idea was just to detect the various ways of doing that and that were not clearly described in the pdf specifications. Although Im currently handling only jpeg images, I will have a look at your sample pdf file, because it presents yet another way of encoding image data, and maybe it will help to understand how
I will come back to you soon when Ill figure out what happens.
De : phisu [mailto:notifications@github.com] Envoyé : samedi 23 juillet 2016 09:42 À : christian-vigh-phpclasses/PdfToText Objet : [christian-vigh-phpclasses/PdfToText] gzuncompress(): data error (#6)
i have some pdf-files which throw the following error when i try to extract the text: $pdf = new PdfToText ($filename) ; echo $pdf->Text;
output: gzuncompress(): data error PdfToText.phpclass 1487
what can i do to prevent this error? the pdf-file you can find here: http://www2.ivm-rheinmain.de/wp-content/uploads/2012/02/Leitfaden_Maerz07.pd f
You are receiving this because you are subscribed to this thread. Reply to this email directly, view it https://github.com/christian-vigh-phpclasses/PdfToText/issues/6 on GitHub, or mute https://github.com/notifications/unsubscribe-auth/ARM8auJT66bQLHz0VDoxtJeh7 m5u0jfrks5qYcXfgaJpZM4JTTva the thread. https://github.com/notifications/beacon/ARM8aqoMxsZQMRvq3wJM7R5kVCyXZCySks5 qYcXfgaJpZM4JTTva.gif
L'absence de virus dans ce courrier électronique a été vérifiée par le logiciel antivirus Avast. https://www.avast.com/antivirus
thank you for your quick answer.
what do you think about to introduce a switch, with which we could avoid the image extraction. is it possible to skip the image extraction? or maybe your class could only return a comment, that some images are not extracted instead of throwing an exception.
this thoughts are maybe to simply. and honestly i have no understanding of the structure of pdf.
Yes, this is what I did in fact with the latest version (1.2.19, but check the header comments instead of the git tags, because I have made a mistake in versioning info and I think they are not in sync). With this version, image data will only be extracted from the pdf file if the PDFOPT_GET_IMAGE_DATA flags is specified as the $options parameter of the constructor (or in the Options property, before calling the Load method) and PDFOPT_DECODE_IMAGE_DATA if you want to transform them as a jpeg resource at the same time.
Now, images are no more extracted by default, so it should run better in some cases (you can download the latest version).
However, even with this default behavior, it seems that my class has a problem with the sample pdf file you sent to me, so I have to investigate a little bit on the origin of this problem.
I agree that throwing an exception when encountering bad image data is definitely a really bad solution ; but this is only a temporary measure : as there are multiple ways to encode images in pdf files (most of them being unknown to me), I made the bet that relying on user experience when receiving an exception would be a good way for me to get an overview on most of the possible test cases that could happen.
As an example, another user got the same exception as you because his pdf file contained images in adobe proprietary format ; he reported me the problem and told me like you that he was not interested in extracting images, and this is why I changed the default behavior of my class. But at least throwing exception in such a case helped me to identify a new image format that I did not handle. Of course, I still do not handle it but it has been identified in my code, and my class does nothing when it encounters such a case.
But wait I can enable exception throwing only if debug mode is enabled, and silently ignore unrecognized image formats when not in debug mode !
Ok so as temporary conclusion to our current exchange :
On your side, download the lastest version of my class and try
again. Let me know the outcome of your testing
On my side I will do the following :
o Change my class so that no exception will be thrown upon unrecognized image formats if the PdfToText ::$DEBUG global variable is not set to true
o Investigate the problem on the first sample you sent to me (Leitfaden_Maerz07.pdf), because I suspect it is not clearly related to what I explained above. but more on this later
In any case, I will come back to you when a new version will be available (I hope this to be ready by tomorrow evening).
Christian.
De : phisu [mailto:notifications@github.com] Envoyé : samedi 23 juillet 2016 20:57 À : christian-vigh-phpclasses/PdfToText Cc : christian-vigh-phpclasses; Comment Objet : Re: [christian-vigh-phpclasses/PdfToText] gzuncompress(): data error (#6)
thank you for your quick answer.
what do you think about to introduce a switch, with which we could avoid the image extraction. is it possible to skip the image extraction? or maybe your class could only return a comment, that some images are not extracted instead of throwing an exception.
this thoughts are maybe to simply. and honestly i have no understanding of the structure of pdf.
You are receiving this because you commented. Reply to this email directly, view https://github.com/christian-vigh-phpclasses/PdfToText/issues/6#issuecommen t-234734251 it on GitHub, or mute https://github.com/notifications/unsubscribe-auth/ARM8amRPHvWvEpytuZjpp2xrU 5HGEwD_ks5qYmQVgaJpZM4JTTva the thread. https://github.com/notifications/beacon/ARM8auCodj4SVzR4gZ99mAPn40aR6IPLks5 qYmQVgaJpZM4JTTva.gif
L'absence de virus dans ce courrier électronique a été vérifiée par le logiciel antivirus Avast. https://www.avast.com/antivirus
thank you a lot for you very quick answer!
i downloaded the latest version of your class ( [Version : 1.2.19] [Date : 2016/07/19] ) and made a test with the same pdf.
$filename = 'Leitfaden_Maerz07.pdf';
$pdf = new PdfToText ($filename) ;
echo $pdf->Text;
in the browswer i got no output but in der apache-log the following error:
PHP Fatal error: Uncaught exception 'PdfToTextException' with message 'Pdf decoding error (object #425) : Invalid gzip data.' in PdfToText.php:1490\nStack trace:\n#0 PdfToText.php(1078): PdfToText->DecodeData(425, '\\x08\\xC0\\xC5\\xDFe\\x1C~\\xBC\\x84\\x1A\\x7F\\xB5+
\xA1...', 3)\n#1 PdfToText.php(935): PdfToText->Load('Leitfaden_Maerz07.pdf')\n#2 test.php(26): PdfToText->__construct('...')\n#3 {main}\n thrown in PdfToText.php on line 1490
` i give you an other pdf file, which produce the same error. maybe this helps to find what is going wrong. http://wiki.iao.fraunhofer.de/images/studien/green-office.pdf
philipp.
Hi Philipp,
I thank you for this additional work and for sending me a second sample (I will have more chances to identify the issue that way).
With this latest version of my class youve got a different error message but its simply because I slightly change dit (as well as the way to handle such errors).
This is an interesting case ; it is clearly not linked to image extraction, since its disabled by default with my latest version. I suspect that some part of the pdf file has been mistakenly recognized as containing something like character maps or drawing instructions in compressed format, but that it does not contain at all gzipped data.
I thank you for the testing, since it gave me additional information.
Ill come back to you when this issue will be solved.
Christian.
De : phisu [mailto:notifications@github.com] Envoyé : dimanche 24 juillet 2016 08:16 À : christian-vigh-phpclasses/PdfToText Cc : christian-vigh-phpclasses; Comment Objet : Re: [christian-vigh-phpclasses/PdfToText] gzuncompress(): data error (#6)
thank you a lot for you very quick answer!
i downloaded the latest version of your class ( [Version : 1.2.19] [Date : 2016/07/19] ) and made a test with the same pdf.
$filename = 'Leitfaden_Maerz07.pdf';
$pdf = new PdfToText ($filename) ;
echo $pdf->Text;
in the browswer i got no output but in der apache-log the following error:
PHP Fatal error: Uncaught exception 'PdfToTextException' with message 'Pdf decoding error (object #425) : Invalid gzip data.' in PdfToText.php:1490\nStack trace:\n#0 PdfToText.php(1078): PdfToText->DecodeData(425, '\x08\xC0\xC5\xDFe\x1C~\xBC\x84\x1A\x7F\xB5+\xA1...', 3)\n#1 PdfToText.php(935): PdfToText->Load('Leitfaden_Maerz07.pdf')\n#2 test.php(26): PdfToText->__construct('...')\n#3 {main}\n thrown in PdfToText.php on line 1490
` i give you an other pdf file, which produce the same error. maybe this helps to find what is going wrong. http://wiki.iao.fraunhofer.de/images/studien/green-office.pdf
philipp.
You are receiving this because you commented. Reply to this email directly, view https://github.com/christian-vigh-phpclasses/PdfToText/issues/6#issuecommen t-234760220 it on GitHub, or mute https://github.com/notifications/unsubscribe-auth/ARM8atzGF4REIlUDKNzQukAkI UIkvdl4ks5qYwMXgaJpZM4JTTva the thread. https://github.com/notifications/beacon/ARM8avZFdM5iLT1xeHixVHm-RPZ9ckSlks5 qYwMXgaJpZM4JTTva.gif
L'absence de virus dans ce courrier électronique a été vérifiée par le logiciel antivirus Avast. https://www.avast.com/antivirus
Ok Ive progressed a little bit on this issue. The files have been generated with Adobe Acrobat Distiller, and any piece that should normally be encoded in gzip format (which can be uncompressed by the standard gzuncompress() PHP function) seems to be encoded in a different format, which seems Adobe-specific.
I changed my class not to throw an exception when such an encoding method is encountered and the PdfToText ::$DEBUG global variable is set to false (which is the default value).
However, my class is unable to extract anything from your samples : even the text-drawing instructions are compressed in such a specific format, so the Text property is empty.
I already found such a situation in one or two samples, but it did not concern text drawing instructions.
So what I have to do now is to find some reliable documentation about what seems to me to be a strange compression format, then implement it More on this later !
Christian.
De : phisu [mailto:notifications@github.com] Envoyé : dimanche 24 juillet 2016 08:16 À : christian-vigh-phpclasses/PdfToText Cc : christian-vigh-phpclasses; Comment Objet : Re: [christian-vigh-phpclasses/PdfToText] gzuncompress(): data error (#6)
thank you a lot for you very quick answer!
i downloaded the latest version of your class ( [Version : 1.2.19] [Date : 2016/07/19] ) and made a test with the same pdf.
$filename = 'Leitfaden_Maerz07.pdf';
$pdf = new PdfToText ($filename) ;
echo $pdf->Text;
in the browswer i got no output but in der apache-log the following error:
PHP Fatal error: Uncaught exception 'PdfToTextException' with message 'Pdf decoding error (object #425) : Invalid gzip data.' in PdfToText.php:1490\nStack trace:\n#0 PdfToText.php(1078): PdfToText->DecodeData(425, '\x08\xC0\xC5\xDFe\x1C~\xBC\x84\x1A\x7F\xB5+\xA1...', 3)\n#1 PdfToText.php(935): PdfToText->Load('Leitfaden_Maerz07.pdf')\n#2 test.php(26): PdfToText->__construct('...')\n#3 {main}\n thrown in PdfToText.php on line 1490
` i give you an other pdf file, which produce the same error. maybe this helps to find what is going wrong. http://wiki.iao.fraunhofer.de/images/studien/green-office.pdf
philipp.
You are receiving this because you commented. Reply to this email directly, view https://github.com/christian-vigh-phpclasses/PdfToText/issues/6#issuecommen t-234760220 it on GitHub, or mute https://github.com/notifications/unsubscribe-auth/ARM8atzGF4REIlUDKNzQukAkI UIkvdl4ks5qYwMXgaJpZM4JTTva the thread. https://github.com/notifications/beacon/ARM8avZFdM5iLT1xeHixVHm-RPZ9ckSlks5 qYwMXgaJpZM4JTTva.gif
L'absence de virus dans ce courrier électronique a été vérifiée par le logiciel antivirus Avast. https://www.avast.com/antivirus
Hi again Philipp,
Ok I found out what happens. Here is what I tried :
Save the file using Acrobat Reader ; nothing changed, no text
extraction happens
Print the file using PdfCreator : it simply failed, displaying an
error message saying that there was a conversion error !
It also failed with PrimoPdf
However I have been more successful with PdfPro 10 : simply print
your file, run the PdfToText class on the result, and you will see your text.
Of course, this is not acceptable : it just helped me understand what happens. The PdfPro 10 software just removed encryption before generating the output file.
In fact I already knew that Pdf files can be password-protected (and handling password-protected pdf files is on my to-do list). However, all the data in your samples have been encrypted but no password is required to be able to read them with Acrobat - and this is why there is an « invalid gzip data » error ; this is because the gzipped data needs to be decrypted before being uncompressed. This is yet another new case I have to handle.
Ok, this will require me a few days to solve this issue but it is a really interesting case that will help me to go a step further for handling password-protected files (note that I do not intend to provide a password-cracking solution !).
Christian.
De : phisu [mailto:notifications@github.com] Envoyé : dimanche 24 juillet 2016 08:16 À : christian-vigh-phpclasses/PdfToText Cc : christian-vigh-phpclasses; Comment Objet : Re: [christian-vigh-phpclasses/PdfToText] gzuncompress(): data error (#6)
thank you a lot for you very quick answer!
i downloaded the latest version of your class ( [Version : 1.2.19] [Date : 2016/07/19] ) and made a test with the same pdf.
$filename = 'Leitfaden_Maerz07.pdf';
$pdf = new PdfToText ($filename) ;
echo $pdf->Text;
in the browswer i got no output but in der apache-log the following error:
PHP Fatal error: Uncaught exception 'PdfToTextException' with message 'Pdf decoding error (object #425) : Invalid gzip data.' in PdfToText.php:1490\nStack trace:\n#0 PdfToText.php(1078): PdfToText->DecodeData(425, '\x08\xC0\xC5\xDFe\x1C~\xBC\x84\x1A\x7F\xB5+\xA1...', 3)\n#1 PdfToText.php(935): PdfToText->Load('Leitfaden_Maerz07.pdf')\n#2 test.php(26): PdfToText->__construct('...')\n#3 {main}\n thrown in PdfToText.php on line 1490
` i give you an other pdf file, which produce the same error. maybe this helps to find what is going wrong. http://wiki.iao.fraunhofer.de/images/studien/green-office.pdf
philipp.
You are receiving this because you commented. Reply to this email directly, view https://github.com/christian-vigh-phpclasses/PdfToText/issues/6#issuecommen t-234760220 it on GitHub, or mute https://github.com/notifications/unsubscribe-auth/ARM8atzGF4REIlUDKNzQukAkI UIkvdl4ks5qYwMXgaJpZM4JTTva the thread. https://github.com/notifications/beacon/ARM8avZFdM5iLT1xeHixVHm-RPZ9ckSlks5 qYwMXgaJpZM4JTTva.gif
L'absence de virus dans ce courrier électronique a été vérifiée par le logiciel antivirus Avast. https://www.avast.com/antivirus
hi christian,
thank you for your efforts. i hope you will find a solution, because i tried several classes to extract the text from pdf. and it seems to me that your class is the best. let me know when i can help you.
philipp
Am 2016-07-25 um 00:39 schrieb christian-vigh-phpclasses:
Hi again Philipp,
Ok I found out what happens. Here is what I tried :
- Save the file using Acrobat Reader ; nothing changed, no text extraction happens
- Print the file using PdfCreator : it simply failed, displaying an error message saying that there was a conversion error !
- It also failed with PrimoPdf
- However I have been more successful with PdfPro 10 : simply print your file, run the PdfToText class on the result, and you will see your text.
Of course, this is not acceptable : it just helped me understand what happens. The PdfPro 10 software just removed encryption before generating the output file.
In fact I already knew that Pdf files can be password-protected (and handling password-protected pdf files is on my to-do list). However, all the data in your samples have been encrypted but no password is required to be able to read them with Acrobat - and this is why there is an « invalid gzip data » error ; this is because the gzipped data needs to be decrypted before being uncompressed. This is yet another new case I have to handle.
Ok, this will require me a few days to solve this issue but it is a really interesting case that will help me to go a step further for handling password-protected files (note that I do not intend to provide a password-cracking solution !).
Christian.
De : phisu [mailto:notifications@github.com] Envoyé : dimanche 24 juillet 2016 08:16 À : christian-vigh-phpclasses/PdfToText Cc : christian-vigh-phpclasses; Comment Objet : Re: [christian-vigh-phpclasses/PdfToText] gzuncompress(): data error (#6)
thank you a lot for you very quick answer!
i downloaded the latest version of your class ( [Version : 1.2.19] [Date : 2016/07/19] ) and made a test with the same pdf.
$filename = 'Leitfaden_Maerz07.pdf'; $pdf = new PdfToText ($filename) ; echo $pdf->Text;
in the browswer i got no output but in der apache-log the following error:
PHP Fatal error: Uncaught exception 'PdfToTextException' with message 'Pdf decoding error (object #425) : Invalid gzip data.' in PdfToText.php:1490\nStack trace:\n#0 PdfToText.php(1078): PdfToText->DecodeData(425, '\x08\xC0\xC5\xDFe\x1C~\xBC\x84\x1A\x7F\xB5+\xA1...', 3)\n#1 PdfToText.php(935): PdfToText->Load('Leitfaden_Maerz07.pdf')\n#2 test.php(26): PdfToText->__construct('...')\n#3 {main}\n thrown in PdfToText.php on line 1490
` i give you an other pdf file, which produce the same error. maybe this helps to find what is going wrong. http://wiki.iao.fraunhofer.de/images/studien/green-office.pdf
philipp.
You are receiving this because you commented. Reply to this email directly, view https://github.com/christian-vigh-phpclasses/PdfToText/issues/6#issuecommen t-234760220 it on GitHub, or mute https://github.com/notifications/unsubscribe-auth/ARM8atzGF4REIlUDKNzQukAkI UIkvdl4ks5qYwMXgaJpZM4JTTva the thread. https://github.com/notifications/beacon/ARM8avZFdM5iLT1xeHixVHm-RPZ9ckSlks5 qYwMXgaJpZM4JTTva.gif
L'absence de virus dans ce courrier électronique a été vérifiée par le logiciel antivirus Avast. https://www.avast.com/antivirus
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/christian-vigh-phpclasses/PdfToText/issues/6#issuecomment-234807273, or mute the thread https://github.com/notifications/unsubscribe-auth/AC-Jo3f6NYxKSfoNH0V-uvG_cxygOxDxks5qY-mcgaJpZM4JTTva.
i have some pdf-files which throw the following error when i try to extract the text:
$pdf = new PdfToText ($filename) ; echo $pdf->Text;
output: gzuncompress(): data error PdfToText.phpclass 1487
what can i do to prevent this error? the pdf-file you can find here: http://www2.ivm-rheinmain.de/wp-content/uploads/2012/02/Leitfaden_Maerz07.pdf