christian-vigh-phpclasses / PdfToText

Extracts text from PDF files
Other
125 stars 93 forks source link

problem with hyphen #10

Open phisu opened 8 years ago

phisu commented 8 years ago

hello Christian,

in mostly every pdf we find hyphens. when the hyphens are on the end of a line, i guess, we are mostly not interested in them. the quality of the extracted text is maybe better, if they are eliminated. this could be done by a extra cleanup of the output of your class or by your class itself. what do you think about that?

philipp

christian-vigh-phpclasses commented 8 years ago

Hello Philip,

Well, to tell the truth, the initial version of my class did suppress hyphens ; I noticed that when running it with the Microsoft RTF Specifications, converted to a PDF file.

I finally suppressed it because during the following weeks, I did not have any new sample showing such samples, and I was afraid of side-effects.

However, now it seems that it makes sense to put it back. I think I will add a PDFOPT_UNHYPHENATE option in the constructor, so that the output text will be post-processed to remove hyphens.

I will come back to you when the new version will be available.

Christian.


De : phisu [mailto:notifications@github.com] Envoyé : lundi 8 août 2016 11:37 À : christian-vigh-phpclasses/PdfToText Objet : [christian-vigh-phpclasses/PdfToText] problem with hyphen (#10)

hello Christian,

in mostly every pdf we find hyphens. when the hyphens are on the end of a line, i guess, we are mostly not interested in them. the quality of the extracted text is maybe better, if they are eliminated. this could be done by a extra cleanup of the output of your class or by your class itself. what do you think about that?

philipp

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it https://github.com/christian-vigh-phpclasses/PdfToText/issues/10 on GitHub, or mute https://github.com/notifications/unsubscribe-auth/ARM8asZy5pKx9BJiowMWHHFIa yFp4iLwks5qdvjIgaJpZM4Je3eP the thread. https://github.com/notifications/beacon/ARM8aqjGj8BpoHZ8hfYqhD5AooFoDk-Iks5 qdvjIgaJpZM4Je3eP.gif


L'absence de virus dans ce courrier électronique a été vérifiée par le logiciel antivirus Avast. https://www.avast.com/antivirus

christian-vigh-phpclasses commented 8 years ago

Ooops I completely forgot : do you have a sample to give to me ? or recommend me on sample you already sent to me ?


De : phisu [mailto:notifications@github.com] Envoyé : lundi 8 août 2016 11:37 À : christian-vigh-phpclasses/PdfToText Objet : [christian-vigh-phpclasses/PdfToText] problem with hyphen (#10)

hello Christian,

in mostly every pdf we find hyphens. when the hyphens are on the end of a line, i guess, we are mostly not interested in them. the quality of the extracted text is maybe better, if they are eliminated. this could be done by a extra cleanup of the output of your class or by your class itself. what do you think about that?

philipp

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it https://github.com/christian-vigh-phpclasses/PdfToText/issues/10 on GitHub, or mute https://github.com/notifications/unsubscribe-auth/ARM8asZy5pKx9BJiowMWHHFIa yFp4iLwks5qdvjIgaJpZM4Je3eP the thread. https://github.com/notifications/beacon/ARM8aqjGj8BpoHZ8hfYqhD5AooFoDk-Iks5 qdvjIgaJpZM4Je3eP.gif


L'absence de virus dans ce courrier électronique a été vérifiée par le logiciel antivirus Avast. https://www.avast.com/antivirus

christian-vigh-phpclasses commented 8 years ago

Hello Philipp,

I’m glad to tell you that the PdfToText V1.2.36 class is now able to “un-hyphenate” words. Simply specify the PDFOPT_NO_HYPHENATED_WORDS for the $options parameter of the constructor or of the Load() method.

I’ve noticed one unwanted side-effect in your sample “150701-DSE-Katalog-verlinkt.pdf” : the output text

        à-la-carte-

        Speisen

Is displayed as :

        à-la-carteSpeisen

Maybe it will be better once I’ll have implemented a more robust management of x/y coordinates, but don’t expect miracles !

However, the rest of the text contents, which contains many hyphenated words, seems to look fine.

Christian.


De : phisu [mailto:notifications@github.com] Envoyé : lundi 8 août 2016 11:37 À : christian-vigh-phpclasses/PdfToText Objet : [christian-vigh-phpclasses/PdfToText] problem with hyphen (#10)

hello Christian,

in mostly every pdf we find hyphens. when the hyphens are on the end of a line, i guess, we are mostly not interested in them. the quality of the extracted text is maybe better, if they are eliminated. this could be done by a extra cleanup of the output of your class or by your class itself. what do you think about that?

philipp

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it https://github.com/christian-vigh-phpclasses/PdfToText/issues/10 on GitHub, or mute https://github.com/notifications/unsubscribe-auth/ARM8asZy5pKx9BJiowMWHHFIa yFp4iLwks5qdvjIgaJpZM4Je3eP the thread. https://github.com/notifications/beacon/ARM8aqjGj8BpoHZ8hfYqhD5AooFoDk-Iks5 qdvjIgaJpZM4Je3eP.gif


L'absence de virus dans ce courrier électronique a été vérifiée par le logiciel antivirus Avast. https://www.avast.com/antivirus

phisu commented 8 years ago

hello christian, i will take a closer look on the hyphens in my pdf files. but, i tried your new version [Version : 1.2.36] [Date : 2016/08/07] with the following file:

https://www.digitales.oesterreich.gv.at/documents/22124/30428/BarrierefreiesInternet_WCAG_Aspekte_SdOeB_20100818.pdf/9dc7ffb9-6420-406d-be6e-a0624e91547b

the output starts with:

61111113111399111111111111391121111911111311137111111146111911113 43 6111111311139911111111111139112111191111131113711111119111911113 7111111111361111111111119113113111381911911131213211111119111111911111111114336111111911111911387745311137444434545454443 61311113111111111111311131111911399191311 111111137111111111111921911114311131113911211119111113111371111111911191111312139 11139111111111131911111311131111111131119111111436111111119113191213111153721381111111311138119111111111111 3■3 1111311111111113111911111381111111 11111139114361111111131113811111111111111347112437119111 911119143911911911119114433■3

with version [Version : 1.2.35] [Date : 2016/08/06] the output of the same file was very fine.

philipp

phisu commented 8 years ago

hello christian, i think the elimination of hyphens is not so important than the a akurat output of white-spaces and line-breaks.

philipp

christian-vigh-phpclasses commented 8 years ago

Hello Philipp,

It’s too late ! I implemented this feature in the early versions of my class then removed it because I feared side effects.

I added it again : it was nothing and took me an hour to complete. Sometimes I need to work on easy things…

Christian.


De : phisu [mailto:notifications@github.com] Envoyé : mardi 9 août 2016 10:03 À : christian-vigh-phpclasses/PdfToText Cc : christian-vigh-phpclasses; Comment Objet : Re: [christian-vigh-phpclasses/PdfToText] problem with hyphen (#10)

hello christian, i think the elimination of hyphens is not so important than the a akurat output of white-spaces and line-breaks.

philipp

— You are receiving this because you commented. Reply to this email directly, view https://github.com/christian-vigh-phpclasses/PdfToText/issues/10#issuecomme nt-238482045 it on GitHub, or mute https://github.com/notifications/unsubscribe-auth/ARM8ald1FHdilDBR8ng1zo1sB jg1x53aks5qeDQkgaJpZM4Je3eP the thread. https://github.com/notifications/beacon/ARM8an50_UmgICjHCziu41nSiW1hlF8uks5 qeDQkgaJpZM4Je3eP.gif


L'absence de virus dans ce courrier électronique a été vérifiée par le logiciel antivirus Avast. https://www.avast.com/antivirus

christian-vigh-phpclasses commented 8 years ago

Hello Philipp,

I solved this problem late this night before you performed your testings.

It was due to my complete reworking of how I’m handling Unicode to UTF8 translations. One internal function, which was accepting a character s a parameter, now accepts an integer value. I just missed 2 calls in my code which were still supplying a character value as a parameter.

The latest version, 1.2.38, solved that (I tried it on the sample you sent to me).

Christian.


De : phisu [mailto:notifications@github.com] Envoyé : mardi 9 août 2016 08:31 À : christian-vigh-phpclasses/PdfToText Cc : christian-vigh-phpclasses; Comment Objet : Re: [christian-vigh-phpclasses/PdfToText] problem with hyphen (#10)

hello christian, i will take a closer look on the hyphens in my pdf files. but, i tried your new version [Version : 1.2.36] [Date : 2016/08/07] with the following file:

https://www.digitales.oesterreich.gv.at/documents/22124/30428/BarrierefreiesInternet_WCAG_Aspekte_SdOeB_20100818.pdf/9dc7ffb9-6420-406d-be6e-a0624e91547b

the output starts with:

61111113111399111111111111391121111911111311137111111146111911113 43 6111111311139911111111111139112111191111131113711111119111911113 7111111111361111111111119113113111381911911131213211111119111111911111111114336111111911111911387745311137444434545454443 61311113111111111111311131111911399191311 111111137111111111111921911114311131113911211119111113111371111111911191111312139 11139111111111131911111311131111111131119111111436111111119113191213111153721381111111311138119111111111111 3■3 1111311111111113111911111381111111 11111139114361111111131113811111111111111347112437119111 911119143911911911119114433■3

with version [Version : 1.2.35] [Date : 2016/08/06] the output of the same file was very fine.

philipp

— You are receiving this because you commented. Reply to this email directly, view https://github.com/christian-vigh-phpclasses/PdfToText/issues/10#issuecomment-238465442 it on GitHub, or mute https://github.com/notifications/unsubscribe-auth/ARM8amMHYNivYV0uj2tuAzVlmMyOzt8Lks5qeB6sgaJpZM4Je3eP the thread. https://github.com/notifications/beacon/ARM8al6dVdqX7GHE84_XoIDm6wKJ4BnOks5qeB6sgaJpZM4Je3eP.gif


L'absence de virus dans ce courrier électronique a été vérifiée par le logiciel antivirus Avast. https://www.avast.com/antivirus

shravspy commented 4 years ago

I want hyphens in my pdf. Is there an option not to remove it with layout, because as of now it removes all the hyphens from my table in pdf.