christian-vigh-phpclasses / PdfToText

Extracts text from PDF files
Other
125 stars 93 forks source link

Some Japanese characters not shown/extracted properly #14

Open destinedjagold opened 7 years ago

destinedjagold commented 7 years ago

Hello and good day.

After testing with a couple of pdf files, I have discovered that not all Japanese characters are being extracted properly. They are numerous though. I'll attach the test pdf I'm using for you to test.

Thank you for your time. test_pdf_2.pdf

christian-vigh-phpclasses commented 7 years ago

Hello,

Far-east languages are a big issue to me ! the sample PDF file you sent internally contains lots of different font maps and also syntactic constructs I was not aware of.

Hopefully, you provided me with a short example, which will be easier for me to debug my class.

I have put this issue on my todo list ; this will take a little time…

I will come back to you when the issue is fixed.

With kind regards,

Christian.


De : destinedjagold [mailto:notifications@github.com] Envoyé : mercredi 15 février 2017 08:52 À : christian-vigh-phpclasses/PdfToText Cc : Subscribed Objet : [christian-vigh-phpclasses/PdfToText] Some Japanese characters not shown/extracted properly (#14)

Hello and good day.

After testing with a couple of pdf files, I have discovered that not all Japanese characters are being extracted properly. They are numerous though. I'll attach the test pdf I'm using for you to test.

Thank you for your time. test_pdf_2.pdf https://github.com/christian-vigh-phpclasses/PdfToText/files/776273/test_pd f_2.pdf

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it https://github.com/christian-vigh-phpclasses/PdfToText/issues/14 on GitHub, or mute https://github.com/notifications/unsubscribe-auth/ARM8aqUe9xPGW5Ap8Exsu8Y_A 5_ayXGrks5rcq6ogaJpZM4MBZmX the thread. https://github.com/notifications/beacon/ARM8asSOLmikghnYWzHzaRy_Ens5vjs1ks5 rcq6ogaJpZM4MBZmX.gif


L'absence de virus dans ce courrier électronique a été vérifiée par le logiciel antivirus Avast. https://www.avast.com/antivirus

christian-vigh-phpclasses commented 7 years ago

Hello,

Thanks again for sending me this sample ! this issue has been fixed with version 1.4.1.

Please feel free to contact me if you have any other issues.

With kind regards,

Christian.


De : destinedjagold [mailto:notifications@github.com] Envoyé : mercredi 15 février 2017 08:52 À : christian-vigh-phpclasses/PdfToText Cc : Subscribed Objet : [christian-vigh-phpclasses/PdfToText] Some Japanese characters not shown/extracted properly (#14)

Hello and good day.

After testing with a couple of pdf files, I have discovered that not all Japanese characters are being extracted properly. They are numerous though. I'll attach the test pdf I'm using for you to test.

Thank you for your time. test_pdf_2.pdf https://github.com/christian-vigh-phpclasses/PdfToText/files/776273/test_pd f_2.pdf

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it https://github.com/christian-vigh-phpclasses/PdfToText/issues/14 on GitHub, or mute https://github.com/notifications/unsubscribe-auth/ARM8aqUe9xPGW5Ap8Exsu8Y_A 5_ayXGrks5rcq6ogaJpZM4MBZmX the thread. https://github.com/notifications/beacon/ARM8asSOLmikghnYWzHzaRy_Ens5vjs1ks5 rcq6ogaJpZM4MBZmX.gif


L'absence de virus dans ce courrier électronique a été vérifiée par le logiciel antivirus Avast. https://www.avast.com/antivirus

wanghaisheng commented 7 years ago

@christian-vigh-phpclasses sir can you help me test this attached pdf? right now i use pdfminer to transform pdf to html but it fails on this pdf https://github.com/clear-datacenter/plan/files/524831/1.pdf.zip after download you can remove the suffix .zip to get a pdf file

christian-vigh-phpclasses commented 7 years ago

Dear Sir,

I have had a look at your PDF file. Unfortunately, my PdfToText class won’t be of great help for you !

This file uses CID (character ID) fonts. CID fonts were implemented by Adobe long time before the Unicode standard emerged. They describe how to draw a character, but they do not give any information about which character it is. Even Adobe tools ignore what the underlying character is !

To convince yourself, just open the PDF file with Acrobat Reader, select text then paste it to a text editor such as Notepad++ : you will get the same result as my class does, because we have no correspondence between a CID and its Unicode equivalent.

I have started an experimental implementation of CID fonts ; for the moment, it handles more or less correctly some languages of eastern Europe.

Regarding your PDF file, I suppose it is written in Chinese. Anyway, it uses an Adobe CID font known as “GB 1-5” ; there is a document from Adobe which describes all the characters in this font, giving their corresponding Character ID. But there are two bad news : the first one is that this document from Adobe does not give the Unicode equivalent of each character ID ; the second one is that it defines a little bit more than 30000 characters !

So, I am really sorry to say that I do currently not have a solution to your problem ; maybe in such cases you could use something like OCR ?

Please feel free to contact me if you have any question,

Christian.


De : wanghaisheng [mailto:notifications@github.com] Envoyé : lundi 6 mars 2017 10:32 À : christian-vigh-phpclasses/PdfToText Cc : christian-vigh-phpclasses; Mention Objet : Re: [christian-vigh-phpclasses/PdfToText] Some Japanese characters not shown/extracted properly (#14)

@christian-vigh-phpclasses https://github.com/christian-vigh-phpclasses sir can you help me test this attached pdf? right now i use pdfminer to transform pdf to html but it fails on this pdf https://github.com/clear-datacenter/plan/files/524831/1.pdf.zip after download you can remove the suffix .zip to get a pdf file

— You are receiving this because you were mentioned. Reply to this email directly, view https://github.com/christian-vigh-phpclasses/PdfToText/issues/14#issuecomme nt-284345457 it on GitHub, or mute https://github.com/notifications/unsubscribe-auth/ARM8aiOp-ZdnNMBeXai65Skfy lJQUYciks5ri9J9gaJpZM4MBZmX the thread. https://github.com/notifications/beacon/ARM8alDMD1Px4g3eenMd3UBekYB8sA_qks5 ri9J9gaJpZM4MBZmX.gif


L'absence de virus dans ce courrier électronique a été vérifiée par le logiciel antivirus Avast. https://www.avast.com/antivirus

wanghaisheng commented 7 years ago

@christian-vigh-phpclasses thx your sir .currently we do use ocrmypdf a wonderful toolkit to deal with these situation

4044ever commented 6 years ago

I am using 1.6.7 and the PDF from post #1 gives me a gibberish output that looks mixed Japanese,Hindi. The files opens, well, sort of, in pdfparser.org