christian-vigh-phpclasses / PdfToText

Extracts text from PDF files
Other
123 stars 92 forks source link

problems with german umlaut #7

Open phisu opened 8 years ago

phisu commented 8 years ago

i have some pdf-files which produce unexpected text like this:

$pdf = new PdfToText ($filename) ; echo $pdf->Text;

gef366rdert mittels AMS-Eingliederungsbeihilfe; *\ Andere F366rderungen S326B334 (Sozial366konomische Betriebe 334berlasser)S326B (Sozial366konomische Betriebe) Itworks

366 should be a german umlaut ö 334 should be a german umlaut ü

what is going wrong? the pdf-file you can find here: http://www.dse-wien.at/fileadmin/media/downloads/DSE-Kataloge/150701-DSE-Katalog-verlinkt.pdf

christian-vigh-phpclasses commented 8 years ago

Hi,

Thanks for your feedback and for having taken the time to explain me what should be the output (I’m not really fluent in German…).

Your code is perfectly correct. I suspect there is either a bug in my method of parsing character specifications in a pdf file, or that your pdf file presents a syntax for specifying characters.

I will study your sample and come back to you soon with a solution.


De : phisu [mailto:notifications@github.com] Envoyé : samedi 23 juillet 2016 09:50 À : christian-vigh-phpclasses/PdfToText Objet : [christian-vigh-phpclasses/PdfToText] problems with german umlaut (#7)

i have some pdf-files which produce unexpected text like this:

$pdf = new PdfToText ($filename) ; echo $pdf->Text;

gef366rdert mittels AMS-Eingliederungsbeihilfe; *\ Andere F366rderungen S326B334 (Sozial366konomische Betriebe 334berlasser)S326B (Sozial366konomische Betriebe) Itworks

366 should be a german umlaut ö 334 should be a german umlaut ü

what is going wrong? the pdf-file you can find here: http://www.dse-wien.at/fileadmin/media/downloads/DSE-Kataloge/150701-DSE-Kat alog-verlinkt.pdf

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it https://github.com/christian-vigh-phpclasses/PdfToText/issues/7 on GitHub, or mute https://github.com/notifications/unsubscribe-auth/ARM8ai9dICEI7A4YAPJJLoefD OSbgGHsks5qYceRgaJpZM4JTT0E the thread. https://github.com/notifications/beacon/ARM8avl_0Xsf8aI_COahuxhGjULaVJmbks5 qYceRgaJpZM4JTT0E.gif


L'absence de virus dans ce courrier électronique a été vérifiée par le logiciel antivirus Avast. https://www.avast.com/antivirus

phisu commented 8 years ago

hi,

in german there are some special characters, which are presented in the output of the test pdf file by numbers. i can list you the numbers. maybe it helps:

344 german umlaut ä 304 uppercase german umlaut Ä 366 german umlaut ö 326 uppercase german umlaut Ö 374 german umlaut ü 334 uppercase german umlaut Ü 337 sharp s ß 226 em dash –

and i found in the same document the number 037 which represents the ligature fi . this is a stylistic ligatures. (https://en.wikipedia.org/wiki/Typographic_ligature).

in the same document there is another issue - maybe related with this. the first lines of the output of pdfToText is:


Besch344ftigung &  
Beratung in Wien
In Zusammenarbeit mit

 ARBEIT
 &

B
ER
A TUNG
AUF 1  BLICK

the lines citated below shold be the word BERATUNG. as you can see there are inserted newlines and a whitespace, which makes the word unrecognizable. this problem i find at other locations in this and other documents too. i can give you other examples, if you like :


B
ER
A TUNG

philipp

christian-vigh-phpclasses commented 8 years ago

Hi Philipp,

Many thanks for this additional information.

As I said, I suspect that showing numbers instead of german umlauts may come from a bug in the way I’m parsing text drawing instructions and translating them into UTF8.

However, the case of the ligature fi makes me think that there is a second problem, which is linked to the way I’m handling character maps and translating Unicode points to utf8 (I only discovered with another sample a user sent to me). This is a part of my code I need to completely rewrite, and I hope it to be available within 2 weeks.

Regarding the issue on the word « BERATUNG », I’ll have a look at it but don’t expect miracles. I see clearly one bug because the word is written on 3 lines ; but I’m not sure I will be able to fix the spacing problem. Characters are written by groups which are separated by a spacing value ; most of the time it is used for handling kerning aspects – ie, don’t put a « i » too near from a « n » because you could mistake it with a « m » - but I discovered that sometimes it was used for handling relative positioning, instead of using dedicated instructions for that. This is why I introduced the MinSpaceWidth property to say : « below this value (expressed in 1/1000 of points), this has to be considered as a value for handling kerning aspect » and : « above this value, I need to insert a space ». The default value is 250 (250 thousandths of text points). It may be a little bit low so I will check with a higher value.

Christian.


De : phisu [mailto:notifications@github.com] Envoyé : dimanche 24 juillet 2016 09:23 À : christian-vigh-phpclasses/PdfToText Cc : christian-vigh-phpclasses; Comment Objet : Re: [christian-vigh-phpclasses/PdfToText] problems with german umlaut (#7)

hi,

in german there are some special characters, which are presented in the output of the test pdf file by numbers. i can list you the numbers. maybe it helps:

344 german umlaut ä 304 uppercase german umlaut Ä 366 german umlaut ö 326 uppercase german umlaut Ö 374 german umlaut ü 334 uppercase german umlaut Ü 337 sharp s ß 226 em dash –

and i found in the same document the number 037 which represents the ligature fi . this is a stylistic ligatures. (https://en.wikipedia.org/wiki/Typographic_ligature).

in the same document there is another issue - maybe related with this. the first lines of the output of pdfToText is:

Besch344ftigung &
Beratung in Wien In Zusammenarbeit mit

ARBEIT &

B ER A TUNG AUF 1 BLICK

the lines citated below shold be the word BERATUNG. as you can see there are inserted newlines and a whitespace, which makes the word unrecognizable. this problem i find at other locations in this and other documents too. i can give you other examples, if you like :

B ER A TUNG

philipp

— You are receiving this because you commented. Reply to this email directly, view https://github.com/christian-vigh-phpclasses/PdfToText/issues/7#issuecommen t-234762478 it on GitHub, or mute https://github.com/notifications/unsubscribe-auth/ARM8avWHDh2LPGgRTT3Fo7ioS -vT1oJvks5qYxLCgaJpZM4JTT0E the thread. https://github.com/notifications/beacon/ARM8atTjarjrv381IBDTGbBliWQum4oyks5 qYxLCgaJpZM4JTT0E.gif


L'absence de virus dans ce courrier électronique a été vérifiée par le logiciel antivirus Avast. https://www.avast.com/antivirus

christian-vigh-phpclasses commented 8 years ago

Hello Philipp,

The problem with german umlauts should be solved now for the sample you sent to me.

For your knowledge, all text drawing instructions in a pdf file can refer to character maps, which can be considered as translation tables. They refer to character codes which are simply entries in the currently active character map and must be replaced with their subsitution character. Character codes in this case are specified in hexadecimal notation, but they also can be specified as escaped octal notation (eg : « \344 » for german umlaut ä, as in your example).

So everything went fine for blocks of text using character maps.

However, it is also possible for a pdf file to specify directly the characters to be drawn, without using a character map. In this case, it is another piece of code in my class that handles this. Unfortunately, I was not aware, until you sent me your sample, that I could also encounter escaped octal notations in such cases. I corrected this issue so the umlauts should display correctly.

However, two issues (at least) remain with your sample :

Christian.


De : phisu [mailto:notifications@github.com] Envoyé : dimanche 24 juillet 2016 09:23 À : christian-vigh-phpclasses/PdfToText Cc : christian-vigh-phpclasses; Comment Objet : Re: [christian-vigh-phpclasses/PdfToText] problems with german umlaut (#7)

hi,

in german there are some special characters, which are presented in the output of the test pdf file by numbers. i can list you the numbers. maybe it helps:

344 german umlaut ä 304 uppercase german umlaut Ä 366 german umlaut ö 326 uppercase german umlaut Ö 374 german umlaut ü 334 uppercase german umlaut Ü 337 sharp s ß 226 em dash –

and i found in the same document the number 037 which represents the ligature fi . this is a stylistic ligatures. (https://en.wikipedia.org/wiki/Typographic_ligature).

in the same document there is another issue - maybe related with this. the first lines of the output of pdfToText is:

Besch344ftigung &
Beratung in Wien In Zusammenarbeit mit

ARBEIT &

B ER A TUNG AUF 1 BLICK

the lines citated below shold be the word BERATUNG. as you can see there are inserted newlines and a whitespace, which makes the word unrecognizable. this problem i find at other locations in this and other documents too. i can give you other examples, if you like :

B ER A TUNG

philipp

— You are receiving this because you commented. Reply to this email directly, view https://github.com/christian-vigh-phpclasses/PdfToText/issues/7#issuecommen t-234762478 it on GitHub, or mute https://github.com/notifications/unsubscribe-auth/ARM8avWHDh2LPGgRTT3Fo7ioS -vT1oJvks5qYxLCgaJpZM4JTT0E the thread. https://github.com/notifications/beacon/ARM8atTjarjrv381IBDTGbBliWQum4oyks5 qYxLCgaJpZM4JTT0E.gif


L'absence de virus dans ce courrier électronique a été vérifiée par le logiciel antivirus Avast. https://www.avast.com/antivirus

phisu commented 8 years ago

hello christian.

thank you for your solution for the problem with the german umlaut. great!

i did tests with other pdf files and i detected an other issue.

$pdf = new PdfToText ($filename) ; 
echo $pdf->Text;

the output of the following file is utf-8 encoded A) http://www.dse-wien.at/fileadmin/media/downloads/DSE-Kataloge/150701-DSE-Katalog-verlinkt.pdf

the output of the following file is windows-1252 (i have other examples too, if you need some others) B) https://www.digitales.oesterreich.gv.at/at.gv.bka.liferay-app/documents/22124/30428/BarrierefreiesInternet_WCAG_Aspekte_SdOeB_20100818.pdf/9dc7ffb9-6420-406d-be6e-a0624e91547b

when i look in the propieries of the file B by the pdf-viewer Evince, is see different character sets: identity-H and windows-1252. i suspect that in a pdf can be blocks with different character sets. so it can happen, that i have to convert different character sets in a string to only one character set.

i tried the class of https://github.com/neitanod/forceutf8 and it seems to work:

include('Encoding.php'); 
use \ForceUTF8\Encoding;
$encoding = new Encoding();
echo $encoding->toUTF8($text);

i think it could be easier. in your class you treat each block from which you know the encoding. so you could set a parameter, that the output should be i.e. utf-8 and every block has to be converted to the forced character set. what do you think about that?

phisu commented 8 years ago

Regarding the issue on whitespaces and some newlines (i.e. on the word « BERATUNG ») i give you some other examples.

example 1 http://www.umweltbundesamt.at/fileadmin/site/umweltthemen/chemikalien/Symbole_RuS_DE.pdf

in this file you can find this issue frequently. i.e. look at the output of the following line on page 1

Symbole; Gefahrenhinweise (R-Sätze) und Sicherheitsratschläge (S-Sätze)

output:

S y mb o l e ;

G e f a h r e nh i n w e i se ( R -S ä t z e )

u nd

S i c h e r h e i t sr a t s c h läg e

( S -S ä t z e)

example 2

in the output of page 13 of the following file newlines are missed: http://www.dse-wien.at/fileadmin/media/downloads/DSE-Kataloge/150701-DSE-Katalog-verlinkt.pdf

Arbeitskräfteüberlassung*   14
SÖB Service 4*                    16

Job-TransFair  
Integrationsleasing*            18
SÖB Kümmerei*                  20

output:

Arbeitskräfteüberlassung* 14SÖB Service 4* 16Job-TransFair
Integrationsleasing* 18SÖB Kümmerei* 20

example 3 http://www.cleanclothes.at/media/common/uploads/download/cck-label-check/CCK-LabelCheck_screen.pdf

in example on page 1:

Produkte aus biologisch angebauter und fair gehandelter Baumwolle. Achten Sie auf ­Gütesiegel wie FAIRTRADE und GOTS.

is outputed as:

P rodukte aus biologisch angebauter und fair

gehandelter Baumwolle. Achten Sie auf G ütesiegel wie FA I RTRADE und GO TS.

or

Unternehmen mit sozialen Standards. Bevorzugen Sie Unternehmen, die Mitglieder bei Kontroll-Initiativen wie der FWF sind.

gets:

Unternehmen mit sozialen Standards. Bevorzugen Sie Unternehmen, die Mitglieder
bei Kontroll- I nitiativen wie der FWF sind.

phisu commented 8 years ago

a strange example for the character mapping is the following file:

http://www.oekoevent.at/uploads/2010/09/FACTSHEET_2008.pdf

i.e. the first pargraph on page 6 produces a strange output

  1. MÜLLBEHÄLTERN FÜR DIE GETRENNTE ABFALLSAMMLUNG IM BEREICH DER GASTRONOMIE (KÜCHE, BAR, BUFFET) Durch einfache Trennmaßnahmen lässt sich die Restmüllmenge in einem Restaurant oder einer Imbissstube um bis zu 50 % verringern. Auf Wunsch des Veranstalters kann die Entsorgung aller Fraktionen organisiert werden.

output:

0 -�,,"%(�,4%2. F�2 $)% '%42%..4% A"FA,,SA--,5.' )- "%2%)#( $%2
'AS42/./-)% +�#(% "A2 "5FF%4 $urch einFache 4rennMa�nahMen l�sst sich die 2estM�llMenge in eineM 2estaurant oder einer )MBissstuBe uM Bis zu 0  verringern AuF 7unsch des Veranstalters kann die %ntsorgung aller Fraktionen organisiert Werden

hope that helps to find the issue. philipp

christian-vigh-phpclasses commented 8 years ago

Hello Philipp,

Can you confirm to me that sample A) gives correct results ?

You are perfectly right : the same pdf file can contain many different character encodings. The first sample I received in May contained text in english, arabic, chinese, russian and hebrew. To tell the truth, my user feels a little bit frustrated because I did not solved his issues yet. However, those issues are a little bit hard to understand since the pdf file format can be really tricky.

This is why the sample you sent to me below is of great help because it is clearly identifies one of the problems I have and will allow me to address a part of my user problems by giving me a better understanding on character encoding.

And thanks for the link on github : It provides me with valuable information about encoding problems.

I’m afraid however that before handling your problem, I will have to completely rewrite my class that handles unicode character maps and translations to utf8, because I’m seeing more and more small problems coming from the way it his written now.

Ok, I’m putting this issue in my ultra-performant bug tracking system, an Excel 2003 file…

I will come back to you once this issue will be solved.

Christian.


De : phisu [mailto:notifications@github.com] Envoyé : mercredi 27 juillet 2016 10:32 À : christian-vigh-phpclasses/PdfToText Cc : christian-vigh-phpclasses; Comment Objet : Re: [christian-vigh-phpclasses/PdfToText] problems with german umlaut (#7)

hello christian.

thank you for your solution for the problem with the german umlaut. great!

i did tests with other pdf files and i detected an other issue.

$pdf = new PdfToText ($filename) ; echo $pdf->Text;

the output of the following file is utf-8 encoded A) http://www.dse-wien.at/fileadmin/media/downloads/DSE-Kataloge/150701-DSE-Kat alog-verlinkt.pdf

the output of the following file is windows-1252 (i have other examples too, if you need some others) B) https://www.digitales.oesterreich.gv.at/at.gv.bka.liferay-app/documents/2212 4/30428/BarrierefreiesInternet_WCAG_Aspekte_SdOeB_20100818.pdf/9dc7ffb9-6420 -406d-be6e-a0624e91547b

when i look in the propieries of the file B by the pdf-viewer Evince, is see different character sets: identity-H and windows-1252. i suspect that in a pdf can be blocks with different character sets. so it can happen, that i have to convert different character sets in a string to only one character set.

i tried the class of https://github.com/neitanod/forceutf8 and it seems to work:

include('Encoding.php'); use \ForceUTF8\Encoding; $encoding = new Encoding(); echo $encoding->toUTF8($text);

i think it could be easier. in your class you treat each block from which you know the encoding. so you could set a parameter, that the output should be i.e. utf-8 and every block has to be converted to the forced character set. what do you think about that?

— You are receiving this because you commented. Reply to this email directly, view https://github.com/christian-vigh-phpclasses/PdfToText/issues/7#issuecommen t-235521368 it on GitHub, or mute https://github.com/notifications/unsubscribe-auth/ARM8aq0mPp37IIHLf8Ue3mW72 qCTOuFPks5qZxd4gaJpZM4JTT0E the thread. https://github.com/notifications/beacon/ARM8aix3N8D71LeXErOfbcxRfoobB70Eks5 qZxd4gaJpZM4JTT0E.gif


L'absence de virus dans ce courrier électronique a été vérifiée par le logiciel antivirus Avast. https://www.avast.com/antivirus

christian-vigh-phpclasses commented 8 years ago

Ok your samples present a new way of positioning characters that I do not handle correctly.

As usual, I’ll come back to you when the issue will be solved !

Christian.


De : phisu [mailto:notifications@github.com] Envoyé : mercredi 27 juillet 2016 12:28 À : christian-vigh-phpclasses/PdfToText Cc : christian-vigh-phpclasses; Comment Objet : Re: [christian-vigh-phpclasses/PdfToText] problems with german umlaut (#7)

Regarding the issue on whitespaces and some newlines (i.e. on the word « BERATUNG ») i give you some other examples.

example 1 http://www.umweltbundesamt.at/fileadmin/site/umweltthemen/chemikalien/Symbol e_RuS_DE.pdf

in this file you can find this issue frequently. i.e. look at the output of the following line on page 1

Symbole; Gefahrenhinweise (R-Sätze) und Sicherheitsratschläge (S-Sätze)

output:

S y mb o l e ;

G e f a h r e nh i n w e i se ( R -S ä t z e )

u nd

S i c h e r h e i t sr a t s c h läg e

( S -S ä t z e)

example 2

in the output of page 13 of the following file newlines are missed: http://www.dse-wien.at/fileadmin/media/downloads/DSE-Kataloge/150701-DSE-Kat alog-verlinkt.pdf

Arbeitskräfteüberlassung* 14 SÖB Service 4* 16

Job-TransFair
Integrationsleasing* 18 SÖB Kümmerei* 20

output:

Arbeitskräfteüberlassung* 14SÖB Service 4* 16Job-TransFair

Integrationsleasing* 18SÖB Kümmerei* 20

example 3 http://www.cleanclothes.at/media/common/uploads/download/cck-label-check/CCK -LabelCheck_screen.pdf

in example on page 1:

Produkte aus biologisch angebauter und fair gehandelter Baumwolle. Achten Sie auf ­Gütesiegel wie FAIRTRADE und GOTS.

is outputed as:

P rodukte aus biologisch angebauter und fair

gehandelter Baumwolle. Achten Sie auf G ütesiegel wie FA I RTRADE und GO TS.

or

Unternehmen mit sozialen Standards. Bevorzugen Sie Unternehmen, die Mitglieder bei Kontroll-Initiativen wie der FWF sind.

gets:

Unternehmen mit sozialen Standards. Bevorzugen Sie Unternehmen, die Mitglieder

bei Kontroll- I nitiativen wie der FWF sind.

— You are receiving this because you commented. Reply to this email directly, view https://github.com/christian-vigh-phpclasses/PdfToText/issues/7#issuecommen t-235547867 it on GitHub, or mute https://github.com/notifications/unsubscribe-auth/ARM8av3d3OdI-N16cWaWGEnCI B621kiCks5qZzLIgaJpZM4JTT0E the thread. https://github.com/notifications/beacon/ARM8aqbkArjTDFEiQbX14vvrjA2R83iGks5 qZzLIgaJpZM4JTT0E.gif


L'absence de virus dans ce courrier électronique a été vérifiée par le logiciel antivirus Avast. https://www.avast.com/antivirus

christian-vigh-phpclasses commented 8 years ago

Ooops ! I found such a strange behavior in the early days of the PdfToText class development on a sample I picked somewhere on the web.

In fact some sentences were completely missing ; If I printed the file using a tool like PdfCreator, I got a result similar to yours regarding the missing sentences.

I gave up at that time because the missing sentences used a font that was defined I don’t know were, but the sample you sent to me last week(the one which caused the gzuncompress error), which I told you some parts were encrypted (although the file was not password-protected) may help in solving this issue.

As usual, I put that on my very elaborate bug tracking system…

Christian.


De : phisu [mailto:notifications@github.com] Envoyé : mercredi 27 juillet 2016 12:36 À : christian-vigh-phpclasses/PdfToText Cc : christian-vigh-phpclasses; Comment Objet : Re: [christian-vigh-phpclasses/PdfToText] problems with german umlaut (#7)

a strange example for the character mapping is the following file:

http://www.oekoevent.at/uploads/2010/09/FACTSHEET_2008.pdf

i.e. the first pargraph on page 6 produces a strange output

  1. MÜLLBEHÄLTERN FÜR DIE GETRENNTE ABFALLSAMMLUNG IM BEREICH DER GASTRONOMIE (KÜCHE, BAR, BUFFET) Durch einfache Trennmaßnahmen lässt sich die Restmüllmenge in einem Restaurant oder einer Imbissstube um bis zu 50 % verringern. Auf Wunsch des Veranstalters kann die Entsorgung aller Fraktionen organisiert werden.

output:

�0� -�,,"%(�,4%2. F�2 $)% '%42%..4% A"FA,,SA--,5.' )- "%2%)#( $%2

'AS42/./-)% �+�#(% "A2 "5FF%4 $urch einFache 4rennMa�nahMen l�sst sich die 2estM�llMenge in eineM 2estaurant oder einer )MBissstuBe uM Bis zu �0 � verringern� AuF 7unsch des Veranstalters kann die %ntsorgung aller Fraktionen organisiert Werden�

hope that helps to find the issue. philipp

— You are receiving this because you commented. Reply to this email directly, view https://github.com/christian-vigh-phpclasses/PdfToText/issues/7#issuecomment-235549314 it on GitHub, or mute https://github.com/notifications/unsubscribe-auth/ARM8avl3A41LFMyvDkFH4tJ5tJVJA3CDks5qZzSfgaJpZM4JTT0E the thread. https://github.com/notifications/beacon/ARM8asz8bGJCpDJzMfp0Arls_nDahhYOks5qZzSfgaJpZM4JTT0E.gif


L'absence de virus dans ce courrier électronique a été vérifiée par le logiciel antivirus Avast. https://www.avast.com/antivirus

phisu commented 8 years ago

hello christian,

yes, i can confirm that the output of sample A) is utf-8 encoded.

philipp

Am 2016-07-27 um 13:11 schrieb christian-vigh-phpclasses:

Hello Philipp,

Can you confirm to me that sample A) gives correct results ?

You are perfectly right : the same pdf file can contain many different character encodings. The first sample I received in May contained text in english, arabic, chinese, russian and hebrew. To tell the truth, my user feels a little bit frustrated because I did not solved his issues yet. However, those issues are a little bit hard to understand since the pdf file format can be really tricky.

This is why the sample you sent to me below is of great help because it is clearly identifies one of the problems I have and will allow me to address a part of my user problems by giving me a better understanding on character encoding.

And thanks for the link on github : It provides me with valuable information about encoding problems.

I’m afraid however that before handling your problem, I will have to completely rewrite my class that handles unicode character maps and translations to utf8, because I’m seeing more and more small problems coming from the way it his written now.

Ok, I’m putting this issue in my ultra-performant bug tracking system, an Excel 2003 file…

I will come back to you once this issue will be solved.

Christian.


De : phisu [mailto:notifications@github.com] Envoyé : mercredi 27 juillet 2016 10:32 À : christian-vigh-phpclasses/PdfToText Cc : christian-vigh-phpclasses; Comment Objet : Re: [christian-vigh-phpclasses/PdfToText] problems with german umlaut (#7)

hello christian.

thank you for your solution for the problem with the german umlaut. great!

i did tests with other pdf files and i detected an other issue.

$pdf = new PdfToText ($filename) ; echo $pdf->Text;

the output of the following file is utf-8 encoded A) http://www.dse-wien.at/fileadmin/media/downloads/DSE-Kataloge/150701-DSE-Kat alog-verlinkt.pdf

the output of the following file is windows-1252 (i have other examples too, if you need some others) B) https://www.digitales.oesterreich.gv.at/at.gv.bka.liferay-app/documents/2212 4/30428/BarrierefreiesInternet_WCAG_Aspekte_SdOeB_20100818.pdf/9dc7ffb9-6420 -406d-be6e-a0624e91547b

when i look in the propieries of the file B by the pdf-viewer Evince, is see different character sets: identity-H and windows-1252. i suspect that in a pdf can be blocks with different character sets. so it can happen, that i have to convert different character sets in a string to only one character set.

i tried the class of https://github.com/neitanod/forceutf8 and it seems to work:

include('Encoding.php'); use \ForceUTF8\Encoding; $encoding = new Encoding(); echo $encoding->toUTF8($text);

i think it could be easier. in your class you treat each block from which you know the encoding. so you could set a parameter, that the output should be i.e. utf-8 and every block has to be converted to the forced character set. what do you think about that?

— You are receiving this because you commented. Reply to this email directly, view https://github.com/christian-vigh-phpclasses/PdfToText/issues/7#issuecommen t-235521368 it on GitHub, or mute https://github.com/notifications/unsubscribe-auth/ARM8aq0mPp37IIHLf8Ue3mW72 qCTOuFPks5qZxd4gaJpZM4JTT0E the thread. https://github.com/notifications/beacon/ARM8aix3N8D71LeXErOfbcxRfoobB70Eks5 qZxd4gaJpZM4JTT0E.gif


L'absence de virus dans ce courrier électronique a été vérifiée par le logiciel antivirus Avast. https://www.avast.com/antivirus

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/christian-vigh-phpclasses/PdfToText/issues/7#issuecomment-235555796, or mute the thread https://github.com/notifications/unsubscribe-auth/AC-Jo6CN2MxYVGM8pom9FtIJUbroRe38ks5qZzzZgaJpZM4JTT0E.

christian-vigh-phpclasses commented 8 years ago

Ok I solved this problem with latest version (tagged 1.2.25, but rely only in the number present in the source file header comment).

For your information, in the postscript-like language that is used in PDF files to draw things, there is an instruction to select the applicable font. It can for example take the following form :

        /R22 10 Tf

« Tf » is the instruction for telling « select a new font », « 10 » is the text size (but don’t ask me how it is computed, I don’t know yet). And « /R22 » means : the characteristics of the font I want to use are described in PDF object #22. Such characteristics may include for example the object number of the Unicode character map associated with this font. So far, so good…

There is a second notation that is authorized :

        /F1 10 Tf

This is an indirection. Somewhere in another object of your PDF file, you will find something like this :

        <</Font<</F1 22 0 R>> … >>

Which says : « everytime I’m referring to font #1 using the « /F1 » notation, then you’ll have to look at object 22 to retrieve its characteristics ».

And that’s all. No more than that in the Adobe PDF specifications.

However one day, someone sent me a sample where I found font references of the following form :

        /f1-0 1 Tf

And even :

        /f-1-0 1 Tf

So I adapted my class to handle this new « syntax ».

Then came your sample, FACTSHEET_2008.pdf, which includes references such as :

        /C0_0 1 Tf

I suspect that when you do not use the /Rx notation, which directly references an object in the pdf file, you can use almost any notation you like to specify an indirect reference (such as /Fx).

However, since I am not sure, I handled this as yet-another new special case. If one day I receive new samples using more different notations, then I’ll have to rework this to provide a more generic method.

As another topic, if you have a look at the output of my class using the new version, you will notice that a few characters are not correctly translated, like if there was a problem in utf8 translation. This is the case indeed and I need to completely rewrite this part of my code, which will also address issues submitted by many users, including the very first one who submitted a document using various foreign languages.

It will take me a few days but after that, I will be able to have a closer look at some of your other issues, such as Windows-1252 handling and extraneous line breaks.

Christian.


De : phisu [mailto:notifications@github.com] Envoyé : mercredi 27 juillet 2016 12:36 À : christian-vigh-phpclasses/PdfToText Cc : christian-vigh-phpclasses; Comment Objet : Re: [christian-vigh-phpclasses/PdfToText] problems with german umlaut (#7)

a strange example for the character mapping is the following file:

http://www.oekoevent.at/uploads/2010/09/FACTSHEET_2008.pdf

i.e. the first pargraph on page 6 produces a strange output

  1. MÜLLBEHÄLTERN FÜR DIE GETRENNTE ABFALLSAMMLUNG IM BEREICH DER GASTRONOMIE (KÜCHE, BAR, BUFFET) Durch einfache Trennmaßnahmen lässt sich die Restmüllmenge in einem Restaurant oder einer Imbissstube um bis zu 50 % verringern. Auf Wunsch des Veranstalters kann die Entsorgung aller Fraktionen organisiert werden.

output:

�0� -�,,"%(�,4%2. F�2 $)% '%42%..4% A"FA,,SA--,5.' )- "%2%)#( $%2

'AS42/./-)% �+�#(% "A2 "5FF%4 $urch einFache 4rennMa�nahMen l�sst sich die 2estM�llMenge in eineM 2estaurant oder einer )MBissstuBe uM Bis zu �0 � verringern� AuF 7unsch des Veranstalters kann die %ntsorgung aller Fraktionen organisiert Werden�

hope that helps to find the issue. philipp

— You are receiving this because you commented. Reply to this email directly, view https://github.com/christian-vigh-phpclasses/PdfToText/issues/7#issuecomment-235549314 it on GitHub, or mute https://github.com/notifications/unsubscribe-auth/ARM8avl3A41LFMyvDkFH4tJ5tJVJA3CDks5qZzSfgaJpZM4JTT0E the thread. https://github.com/notifications/beacon/ARM8asz8bGJCpDJzMfp0Arls_nDahhYOks5qZzSfgaJpZM4JTT0E.gif


L'absence de virus dans ce courrier électronique a été vérifiée par le logiciel antivirus Avast. https://www.avast.com/antivirus

christian-vigh-phpclasses commented 8 years ago

Hello Philipp,

I would like to thank you for your work ; it helped me a lot to find out what was happening !

In fact, most fonts use character maps, which are to be considered as character substitution tables.

But some fonts use Adobe “standard” character maps, which simply use the Windows Ansi or Mac Os Roman character sets :

https://msdn.microsoft.com/en-us/goglobal/cc305145.aspx

ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/APPLE/ROMAN.TXT

I was not even taking the tiniest attention to map the appropriate Unicode character to the original character coming from these code pages. In fact, there is a one-to-one correspondence for most of the character codes.

However, if you have a closer look at the character maps addressed by the links above, you will notice that this is not true : for Windows Ansi, the range 0x80..0x9F map to different Unicode character codes ; and for Mac Os Roman, this is the whole range from 0x80 to 0xFF.

This explains why some characters, like the euro sign (or the TM sign in a sample another user sent to me) where not translated properly.

I implemented a proper ay of hanling that situation so, if you like, a new release 1.2.27 is available here :

http://www.phpclasses.org/package/9732-PHP-Extract-text-contents-from-PDF-fi les.html

Please feel free to contact me if you have further issues (the other issues you submitted to me are still under work…).

With kind regards,

Christian.


De : phisu [mailto:notifications@github.com] Envoyé : mercredi 27 juillet 2016 10:32 À : christian-vigh-phpclasses/PdfToText Cc : christian-vigh-phpclasses; Comment Objet : Re: [christian-vigh-phpclasses/PdfToText] problems with german umlaut (#7)

hello christian.

thank you for your solution for the problem with the german umlaut. great!

i did tests with other pdf files and i detected an other issue.

$pdf = new PdfToText ($filename) ; echo $pdf->Text;

the output of the following file is utf-8 encoded A) http://www.dse-wien.at/fileadmin/media/downloads/DSE-Kataloge/150701-DSE-Kat alog-verlinkt.pdf

the output of the following file is windows-1252 (i have other examples too, if you need some others) B) https://www.digitales.oesterreich.gv.at/at.gv.bka.liferay-app/documents/2212 4/30428/BarrierefreiesInternet_WCAG_Aspekte_SdOeB_20100818.pdf/9dc7ffb9-6420 -406d-be6e-a0624e91547b

when i look in the propieries of the file B by the pdf-viewer Evince, is see different character sets: identity-H and windows-1252. i suspect that in a pdf can be blocks with different character sets. so it can happen, that i have to convert different character sets in a string to only one character set.

i tried the class of https://github.com/neitanod/forceutf8 and it seems to work:

include('Encoding.php'); use \ForceUTF8\Encoding; $encoding = new Encoding(); echo $encoding->toUTF8($text);

i think it could be easier. in your class you treat each block from which you know the encoding. so you could set a parameter, that the output should be i.e. utf-8 and every block has to be converted to the forced character set. what do you think about that?

— You are receiving this because you commented. Reply to this email directly, view https://github.com/christian-vigh-phpclasses/PdfToText/issues/7#issuecommen t-235521368 it on GitHub, or mute https://github.com/notifications/unsubscribe-auth/ARM8aq0mPp37IIHLf8Ue3mW72 qCTOuFPks5qZxd4gaJpZM4JTT0E the thread. https://github.com/notifications/beacon/ARM8aix3N8D71LeXErOfbcxRfoobB70Eks5 qZxd4gaJpZM4JTT0E.gif


L'absence de virus dans ce courrier électronique a été vérifiée par le logiciel antivirus Avast. https://www.avast.com/antivirus

phisu commented 8 years ago

hello Christian, the tests on my pdf-samples look good. but i found a problem concerning this problem: in

http://www.dse-wien.at/fileadmin/media/downloads/DSE-Kataloge/150701-DSE-Katalog-verlinkt.pdf

on page 3 you can find the string:

fix und fertig\ 62

the output of this string is:

x und fertig**

it seems to me, that the ligature for fi is not translated correctly.

an other example, where the ligatur is not shown correctly is in the same pdf on page 6

Sozialversicherungspflicht

output:

Sozialversicherungspicht

on page 8

beruflich

output:

beruichen

it seems to me that the ligatur ft is translated correcftly. or i did not find an example.

in the same pdf i tried to find other examples of ligatures. but it is difficult to search for them. but i found an other problem. take a look at page 6. there you can see that the heading "DSE-Wien Ihr Partner in der aktiven Arbeitsmarktpolitik" is not extracted and every paragraph on this page but the first one is not extracted. the output of this page is:

Sozialintegrative Unternehmen Die vom DSE-Wien vertretenen Organisationen, die Sie in diesem Katalog nden, unterstützen langzeitarbeitslose bzw. arbeitsmarktferne Personen auf vielfältige Weise. Diese gemeinnützigen Organisationen gliedern sich in Beratungsstellen, Sozial-ökonomische Betriebe, Gemeinnützige Beschäftigungsprojekte, gemeinnützige Arbeitskräfteüberlassungen sowie Unternehmen, die mittels Eingliederungsbeihilfe individuell geförderte MitarbeiterInnen beschäftigen. „

an other example of this issue you can find on page 63. only the first to lines of this page are extracted as text.

later i will check other documents and will report you.

philipp

christian-vigh-phpclasses commented 8 years ago

Hello Philip,

I really thank you for this precious information ; it would have taken to me hours to find this issue !

In fact, I am trying to implement a better way to handle Unicode characters, which should be considered as finished when the class will be in version 1.3.

Meanwhile I’m having intermediate experimentations, as this has been the case for versions 1.2.27 and 1.2.28. Apparently, my first trial to translate Unicode characters of more than 2 bytes (which should be the case for the pairs of letters with ligatures) has failed.

Now that I have a precise example to work on, I will be able to review my translation method and have a closer look at what happens.

Christian.


De : phisu [mailto:notifications@github.com] Envoyé : mardi 2 août 2016 09:22 À : christian-vigh-phpclasses/PdfToText Cc : christian-vigh-phpclasses; Comment Objet : Re: [christian-vigh-phpclasses/PdfToText] problems with german umlaut (#7)

hello Christian, the tests on my pdf-samples look good. but i found a problem concerning this problem: in

http://www.dse-wien.at/fileadmin/media/downloads/DSE-Kataloge/150701-DSE-Katalog-verlinkt.pdf

on page 3 you can find the string:

fix und fertig\ 62

the output of this string is:

�x und fertig**

it seems to me, that the ligature for fi is not translated correctly.

an other example, where the ligatur is not shown correctly is in the same pdf on page 6

Sozialversicherungspflicht

output:

Sozialversicherungsp�icht

on page 8

beruflich

output:

beru�ichen

it seems to me that the ligatur ft is translated correcftly. or i did not find an example.

in the same pdf i tried to find other examples of ligatures. but it is difficult to search for them. but i found an other problem. take a look at page 6. there you can see that the heading "DSE-Wien Ihr Partner in der aktiven Arbeitsmarktpolitik" is not extracted and every paragraph on this page but the first one is not extracted. the output of this page is:

Sozialintegrative Unternehmen Die vom DSE-Wien vertretenen Organisationen, die Sie in diesem Katalog �nden, unterstützen langzeitarbeitslose bzw. arbeitsmarktferne Personen auf vielfältige Weise. Diese gemeinnützigen Organisationen gliedern sich in Beratungsstellen, Sozial-ökonomische Betriebe, Gemeinnützige Beschäftigungsprojekte, gemeinnützige Arbeitskräfteüberlassungen sowie Unternehmen, die mittels Eingliederungsbeihilfe individuell geförderte MitarbeiterInnen beschäftigen. ��

��

��

��

��

later i will check other documents and will report you.

philipp

— You are receiving this because you commented. Reply to this email directly, view https://github.com/christian-vigh-phpclasses/PdfToText/issues/7#issuecomment-236822726 it on GitHub, or mute https://github.com/notifications/unsubscribe-auth/ARM8apY1c9VJNHaKEjCydhsTLipk8XHVks5qbvAIgaJpZM4JTT0E the thread. https://github.com/notifications/beacon/ARM8atWStE_WdvuG54PoeozJNHa3X6MOks5qbvAIgaJpZM4JTT0E.gif


L'absence de virus dans ce courrier électronique a été vérifiée par le logiciel antivirus Avast. https://www.avast.com/antivirus

christian-vigh-phpclasses commented 8 years ago

Hello Philipp,

I answered a little bit quickly : I did not see the last issue (some paragraphs not being extracted on page 6).

I have seen the same issue with two other samples. Regarding the last one, I could not extract anything ! in fact, it seems that, although you can freely copy or print the pdf contents, there is some encryption mechanism which makes that, instead of character map definitions to be stored in gzip format, they are encrypted (and maybe gzipped ?). I’ll start with the latest of the two samples, because both character maps and text drawing instructions seem to be encrypted. If I’m able to find out how this particular file is encoded, then I’m confident that it will solve all the issues encountered with the other samples, including yours.

I put that on my to-do list, and I’ll have a look at it once I’ll have finished handling correctly Unicode translations.

Christian.


De : phisu [mailto:notifications@github.com] Envoyé : mardi 2 août 2016 09:22 À : christian-vigh-phpclasses/PdfToText Cc : christian-vigh-phpclasses; Comment Objet : Re: [christian-vigh-phpclasses/PdfToText] problems with german umlaut (#7)

hello Christian, the tests on my pdf-samples look good. but i found a problem concerning this problem: in

http://www.dse-wien.at/fileadmin/media/downloads/DSE-Kataloge/150701-DSE-Katalog-verlinkt.pdf

on page 3 you can find the string:

fix und fertig\ 62

the output of this string is:

�x und fertig**

it seems to me, that the ligature for fi is not translated correctly.

an other example, where the ligatur is not shown correctly is in the same pdf on page 6

Sozialversicherungspflicht

output:

Sozialversicherungsp�icht

on page 8

beruflich

output:

beru�ichen

it seems to me that the ligatur ft is translated correcftly. or i did not find an example.

in the same pdf i tried to find other examples of ligatures. but it is difficult to search for them. but i found an other problem. take a look at page 6. there you can see that the heading "DSE-Wien Ihr Partner in der aktiven Arbeitsmarktpolitik" is not extracted and every paragraph on this page but the first one is not extracted. the output of this page is:

Sozialintegrative Unternehmen Die vom DSE-Wien vertretenen Organisationen, die Sie in diesem Katalog �nden, unterstützen langzeitarbeitslose bzw. arbeitsmarktferne Personen auf vielfältige Weise. Diese gemeinnützigen Organisationen gliedern sich in Beratungsstellen, Sozial-ökonomische Betriebe, Gemeinnützige Beschäftigungsprojekte, gemeinnützige Arbeitskräfteüberlassungen sowie Unternehmen, die mittels Eingliederungsbeihilfe individuell geförderte MitarbeiterInnen beschäftigen. ��

��

��

��

��

later i will check other documents and will report you.

philipp

— You are receiving this because you commented. Reply to this email directly, view https://github.com/christian-vigh-phpclasses/PdfToText/issues/7#issuecomment-236822726 it on GitHub, or mute https://github.com/notifications/unsubscribe-auth/ARM8apY1c9VJNHaKEjCydhsTLipk8XHVks5qbvAIgaJpZM4JTT0E the thread. https://github.com/notifications/beacon/ARM8atWStE_WdvuG54PoeozJNHa3X6MOks5qbvAIgaJpZM4JTT0E.gif


L'absence de virus dans ce courrier électronique a été vérifiée par le logiciel antivirus Avast. https://www.avast.com/antivirus

christian-vigh-phpclasses commented 8 years ago

Hello Philipp,

The problem with the characters with ligatures is absolutely not linked to bad Unicode translation (or at least, not yet).

If fact, by investigating the internals of your pdf file, I noticed that whenever those special characters where drawn, they were using a font whose object number was not defined anywhere.

This is where I discovered a new type of object called “object stream” ; this is an object containing gzipped data which, once unzipped, reveals the additional objects it contains. This is where I found the missing objects describing the fonts that use the characters with ligatures.

This may explain several oddities I found in samples sent by other users, and I’m quite sure that it explains the missing paragraphs at page #6 in yours : they must be contained in “object streams” , which are currently not decoded, hence their absence.

This will take me a few days to implement it, but I’m sure it’s worth the work…

Once I’ll implement it, I will be able to decode those characters, which are specified as 6-bytes Unicode characters.

Christian.


De : phisu [mailto:notifications@github.com] Envoyé : mardi 2 août 2016 09:22 À : christian-vigh-phpclasses/PdfToText Cc : christian-vigh-phpclasses; Comment Objet : Re: [christian-vigh-phpclasses/PdfToText] problems with german umlaut (#7)

hello Christian, the tests on my pdf-samples look good. but i found a problem concerning this problem: in

http://www.dse-wien.at/fileadmin/media/downloads/DSE-Kataloge/150701-DSE-Katalog-verlinkt.pdf

on page 3 you can find the string:

fix und fertig\ 62

the output of this string is:

�x und fertig**

it seems to me, that the ligature for fi is not translated correctly.

an other example, where the ligatur is not shown correctly is in the same pdf on page 6

Sozialversicherungspflicht

output:

Sozialversicherungsp�icht

on page 8

beruflich

output:

beru�ichen

it seems to me that the ligatur ft is translated correcftly. or i did not find an example.

in the same pdf i tried to find other examples of ligatures. but it is difficult to search for them. but i found an other problem. take a look at page 6. there you can see that the heading "DSE-Wien Ihr Partner in der aktiven Arbeitsmarktpolitik" is not extracted and every paragraph on this page but the first one is not extracted. the output of this page is:

Sozialintegrative Unternehmen Die vom DSE-Wien vertretenen Organisationen, die Sie in diesem Katalog �nden, unterstützen langzeitarbeitslose bzw. arbeitsmarktferne Personen auf vielfältige Weise. Diese gemeinnützigen Organisationen gliedern sich in Beratungsstellen, Sozial-ökonomische Betriebe, Gemeinnützige Beschäftigungsprojekte, gemeinnützige Arbeitskräfteüberlassungen sowie Unternehmen, die mittels Eingliederungsbeihilfe individuell geförderte MitarbeiterInnen beschäftigen. ��

��

��

��

��

later i will check other documents and will report you.

philipp

— You are receiving this because you commented. Reply to this email directly, view https://github.com/christian-vigh-phpclasses/PdfToText/issues/7#issuecomment-236822726 it on GitHub, or mute https://github.com/notifications/unsubscribe-auth/ARM8apY1c9VJNHaKEjCydhsTLipk8XHVks5qbvAIgaJpZM4JTT0E the thread. https://github.com/notifications/beacon/ARM8atWStE_WdvuG54PoeozJNHa3X6MOks5qbvAIgaJpZM4JTT0E.gif


L'absence de virus dans ce courrier électronique a été vérifiée par le logiciel antivirus Avast. https://www.avast.com/antivirus

christian-vigh-phpclasses commented 8 years ago

Hello Philipp,

I published a new version, which now handles « object streams”, ie a pdf object that contains in turn several pdf objects. This is why some paragraphs of page 6 were missing : they were “hidden” in the object streams I did not process.

If you have a trial run on “150701-DSE-Katalog-verlinkt.pdf”, you will notice that the special characters using ligatures are “missing” ; in fact they are still present. If you set the PdfToText::$Utf8Placeholder static property to something like “[Unknown character : 0x%X]”, you will notice that the output text will contain things like : [Unknown character 0x660066]”, which is the Unicode codepoint for the “fi” with ligature (or “fl”, I don’t remember). But you won’t see them by default, because the default value of the $Utf8Placeholder property is the empty string.

Also, the output of this version on your sample now has 4883 lines (instead of around 4400 in the previous version).

However, I still have two big issues :

Unveiling the contents of object streams not only revealed whole paragraphs of text which were missing, font descriptions and character maps, but also page content description constructs (they specify which objects are related to which page).

Until last week, I was not aware that such constructs could be nested (a page content description object can refer to other objects containing inner page content description, instead of referring directly to the page contents themselves).

This is why the summary table on page 3 in the pdf file is now located somewhere in the output text, but not at the position it was intended to be.

I will fix this issue in version 1.2.31.

Christian.


De : phisu [mailto:notifications@github.com] Envoyé : mardi 2 août 2016 09:22 À : christian-vigh-phpclasses/PdfToText Cc : christian-vigh-phpclasses; Comment Objet : Re: [christian-vigh-phpclasses/PdfToText] problems with german umlaut (#7)

hello Christian, the tests on my pdf-samples look good. but i found a problem concerning this problem: in

http://www.dse-wien.at/fileadmin/media/downloads/DSE-Kataloge/150701-DSE-Katalog-verlinkt.pdf

on page 3 you can find the string:

fix und fertig\ 62

the output of this string is:

�x und fertig**

it seems to me, that the ligature for fi is not translated correctly.

an other example, where the ligatur is not shown correctly is in the same pdf on page 6

Sozialversicherungspflicht

output:

Sozialversicherungsp�icht

on page 8

beruflich

output:

beru�ichen

it seems to me that the ligatur ft is translated correcftly. or i did not find an example.

in the same pdf i tried to find other examples of ligatures. but it is difficult to search for them. but i found an other problem. take a look at page 6. there you can see that the heading "DSE-Wien Ihr Partner in der aktiven Arbeitsmarktpolitik" is not extracted and every paragraph on this page but the first one is not extracted. the output of this page is:

Sozialintegrative Unternehmen Die vom DSE-Wien vertretenen Organisationen, die Sie in diesem Katalog �nden, unterstützen langzeitarbeitslose bzw. arbeitsmarktferne Personen auf vielfältige Weise. Diese gemeinnützigen Organisationen gliedern sich in Beratungsstellen, Sozial-ökonomische Betriebe, Gemeinnützige Beschäftigungsprojekte, gemeinnützige Arbeitskräfteüberlassungen sowie Unternehmen, die mittels Eingliederungsbeihilfe individuell geförderte MitarbeiterInnen beschäftigen. ��

��

��

��

��

later i will check other documents and will report you.

philipp

— You are receiving this because you commented. Reply to this email directly, view https://github.com/christian-vigh-phpclasses/PdfToText/issues/7#issuecomment-236822726 it on GitHub, or mute https://github.com/notifications/unsubscribe-auth/ARM8apY1c9VJNHaKEjCydhsTLipk8XHVks5qbvAIgaJpZM4JTT0E the thread. https://github.com/notifications/beacon/ARM8atWStE_WdvuG54PoeozJNHa3X6MOks5qbvAIgaJpZM4JTT0E.gif


L'absence de virus dans ce courrier électronique a été vérifiée par le logiciel antivirus Avast. https://www.avast.com/antivirus

phisu commented 8 years ago

hello Christian,

thank you a lot for your work! i report you an other example of the encoding issue. maybe it confirms your understanding of the unicode translation in pdf files. for me its a strange example, because you can find a correct output of german umlauts and wrong output of german umlauts and other letters in one line:

http://www.oekoevent.at/uploads/2010/09/factsheet_giveaways_2007.pdf on page 2 you can find the following paragraph:

Impressum Herausgeberin: Stadt Wien, Geschäftsgruppe Umwelt, 1082 Wien, Rathaus Autorinnen: Mag Henriette Gupfinger, Mag Andrea Ebner Österreichische Gesellschaft für Umwelt und Technik - ÖGUT 1020 Wien, Hollandstrasse 10/46

output:

Impressum Herausgeberin Stadt Wien, 'eschäftsgruppe 5mwelt, 102 Wien, Rathaus Autorinnen Mag a Henriette 'upfinger, Mag a Andrea %bner ¾sterreichische 'esellschaft fàr 5mwelt und Technik - ¾'5T 1020 Wien, Hollandstrasse 10

i am using the following version of your class: [Version : 1.2.35] [Date : 2016/08/06]

philipp

christian-vigh-phpclasses commented 8 years ago

Hello Philipp,

Your sample helped me identify a bug ; in some cases, text drawing instructions where incorrectly parsed and included extra data, which caused the upper layers to interpret it as NUL characters. This has been fixed.

You’ll still notice that there are some characters improperly displayed, most of them being control characters (but not only). For example, the text at the start of the output should be :

        für VeranstaltungenEine Initiative von Umweltstadträtin Ulli Sima

but is displayed as :

        fàr 6eranstaltungenEine Initiative von Umweltstadträtin Ulli Sima

This is due to a problem I have been able to identify only a few days ago : it’s related to what I’ll call “font aliases”. Font aliases are a mean to say : “whenever I’ll use the font reference /C0_1, then you’ll have to look at PDF object #x to get its properties”. What I was unaware of is that the same font alias can be redefined at the page – and not document – level. So, for example, your sample file says for page 1 : “/C0_1 refers to object x”, and for page 2 : “/C0_1 refers to object y”.

As I’m handling this case at the document level, the association “alias /C0_1 – object y” overrides the association “alias /C0_1 – object x”, hence the bad character mapping I cited I the above example.

I have to completely rethink the way I’m extracting page information (ie, which contents are associated with which page, but this is another issue for me) and take into account the fact that each page can have its own context, such as defining specific font aliases.

This will take me a little time but is already on my todo list.

Christian.


De : phisu [mailto:notifications@github.com] Envoyé : lundi 8 août 2016 11:16 À : christian-vigh-phpclasses/PdfToText Cc : christian-vigh-phpclasses; Comment Objet : Re: [christian-vigh-phpclasses/PdfToText] problems with german umlaut (#7)

hello Christian,

thank you a lot for your work! i report you an other example of the encoding issue. maybe it confirms your understanding of the unicode translation in pdf files. for me its a strange example, because you can find a correct output of german umlauts and wrong output of german umlauts and other letters in one line:

http://www.oekoevent.at/uploads/2010/09/factsheet_giveaways_2007.pdf on page 2 you can find the following paragraph:

Impressum Herausgeberin: Stadt Wien, Geschäftsgruppe Umwelt, 1082 Wien, Rathaus Autorinnen: Mag Henriette Gupfinger, Mag Andrea Ebner Österreichische Gesellschaft für Umwelt und Technik - ÖGUT 1020 Wien, Hollandstrasse 10/46

output:

Impressum Herausgeberin� Stadt Wien, 'eschäftsgruppe 5mwelt, 10�2 Wien, Rathaus Autorinnen� Mag a Henriette 'upfinger, Mag a Andrea %bner ¾sterreichische 'esellschaft fàr 5mwelt und Technik - ¾'5T 1020 Wien, Hollandstrasse 10���

i am using the following version of your class: [Version : 1.2.35] [Date : 2016/08/06]

philipp

— You are receiving this because you commented. Reply to this email directly, view https://github.com/christian-vigh-phpclasses/PdfToText/issues/7#issuecomment-238183280 it on GitHub, or mute https://github.com/notifications/unsubscribe-auth/ARM8ajTXCW4dPwm5gGGc8BVBT8Mc2SZCks5qdvPbgaJpZM4JTT0E the thread. https://github.com/notifications/beacon/ARM8ap20Hgcc586Tslk_keabFFivfIMtks5qdvPbgaJpZM4JTT0E.gif


L'absence de virus dans ce courrier électronique a été vérifiée par le logiciel antivirus Avast. https://www.avast.com/antivirus

phisu commented 8 years ago

hello Christian,

i tried to take a closer look on the extracted text of http://www.dse-wien.at/fileadmin/media/downloads/DSE-Kataloge/150701-DSE-Katalog-verlinkt.pdf with the version 1.2.35 of your class. but it is hard to examine if the extracted text is complete because the outputted text is not in the same order like the pages of the pdf.

an example:

if find the text of the last paragraph of page 6

Positive Effekte der Beschäftigung in einem Sozialintegrativen Unternehmen zeigen sich auch im weiteren Erwerbsleben: Personen, die in Form von Transitarbeitsplätzen gefördert wurden, sind in den Jahren danach seltener und weniger lang von Arbeitslosigkeit betroffen. Auch das Einkommen der Geförderten fällt deutlich höher aus: Ehemalige Transitarbeitskräfte erzielen in den folgenden Jahren durchschnittlich ein um ein Drittel höheres Einkommen als vergleichbare ungeförderte Per- sonen. So leisten Sozialintegrative Unternehmen einen wichtigen Beitrag zur Armutsvermeidung und zur sozialen Inklusion.

after that the subheading of page 5

Sozialintegrative Unternehmen

then the text of page 16 then a subheading without a linebreak before (of which page i cannot examine, because there same subheadings on different pages around page 16)

Tätigkeitsfeld

then the text of page 21 but not the heading:

Die KÜMMEREI ist ein Projekt der Job-TransFair GmbH, einem Tochterunternehmen des BFI Wien, kümmert sich darum Schulen auszumalen, in Kindergärten verfliesen, Lehrlinge zu verköstigen, ... und vor allem darum, Ihnen die Möglichkeit zum Arbeiten und zum Lernen zu geben.

etc.

after the paragraphs on the left of the page 21 follows the headings of this page and after that the paragraph on the right side of this page.

after that it follows without a space the text of page 26.

philipp

christian-vigh-phpclasses commented 8 years ago

Hello Philipp,

Yes this is what I said in one of my mails : I recently discovered that objects could be grouped into “object streams”, that I was not processing so far.

This explained why sometimes I found references to objects in some PDF files without being able to find their definition : they were defined in object streams.

Now they are correctly processed. Of course, processing them unveils a lot of new information. For example, this explains missing text paragraphs. But it also unveiled additional information, such as font information and page contents descriptions.

This is the case with your sample : page information was not processed before version 1.2.30. Unfortunately, in your sample, this page information is presented in a format I was not aware of, so I have to rework a little bit this part of my code (well, to tell the truth, I will be a little bit hard…).

Before version 1.2.30, no page information was found in your sample (because it was contained in an object stream, which was unprocessed). In such situations, my class processes all the objects in the order they arrive. This is why the text output was in the right order (but keep in mind that this is not the case for all samples…).

Until I implement a new correct way of processing page information, do you think it could be helpful for you if I added a flag for the $options parameter of the constructor that says : “don’t process page information”, so that the class will behave as before regarding contents processing ?

Christian.


De : phisu [mailto:notifications@github.com] Envoyé : mardi 9 août 2016 09:55 À : christian-vigh-phpclasses/PdfToText Cc : christian-vigh-phpclasses; Comment Objet : Re: [christian-vigh-phpclasses/PdfToText] problems with german umlaut (#7)

hello Christian,

i tried to take a closer look on the extracted text of http://www.dse-wien.at/fileadmin/media/downloads/DSE-Kataloge/150701-DSE-Kat alog-verlinkt.pdf with the version 1.2.35 of your class. but it is hard to examine if the extracted text is complete because the outputted text is not in the same order like the pages of the pdf.

an example:

if find the text of the last paragraph of page 6

Positive Effekte der Beschäftigung in einem Sozialintegrativen Unternehmen zeigen sich auch im weiteren Erwerbsleben: Personen, die in Form von Transitarbeitsplätzen gefördert wurden, sind in den Jahren danach seltener und weniger lang von Arbeitslosigkeit betroffen. Auch das Einkommen der Geförderten fällt deutlich höher aus: Ehemalige Transitarbeitskräfte erzielen in den folgenden Jahren durchschnittlich ein um ein Drittel höheres Einkommen als vergleichbare ungeförderte Per- sonen. So leisten Sozialintegrative Unternehmen einen wichtigen Beitrag zur Armutsvermeidung und zur sozialen Inklusion.

after that the subheading of page 5

Sozialintegrative Unternehmen

then the text of page 16 then a subheading without a linebreak before (of which page i cannot examine, because there same subheadings on different pages around page 16)

Tätigkeitsfeld

then the text of page 21 but not the heading:

Die KÜMMEREI ist ein Projekt der Job-TransFair GmbH, einem Tochterunternehmen des BFI Wien, kümmert sich darum Schulen auszumalen, in Kindergärten verfliesen, Lehrlinge zu verköstigen, ... und vor allem darum, Ihnen die Möglichkeit zum Arbeiten und zum Lernen zu geben.

etc.

after the paragraphs on the left of the page 21 follows the headings of this page and after that the paragraph on the right side of this page.

after that it follows without a space the text of page 26.

philipp

— You are receiving this because you commented. Reply to this email directly, view https://github.com/christian-vigh-phpclasses/PdfToText/issues/7#issuecommen t-238480338 it on GitHub, or mute https://github.com/notifications/unsubscribe-auth/ARM8aqVUR421ekvF75_j9uKvb _eN4cDmks5qeDI4gaJpZM4JTT0E the thread. https://github.com/notifications/beacon/ARM8agjIEwBwMj4a5Dbw3r-_-FBqls72ks5 qeDI4gaJpZM4JTT0E.gif


L'absence de virus dans ce courrier électronique a été vérifiée par le logiciel antivirus Avast. https://www.avast.com/antivirus

phisu commented 8 years ago

hello christian, no, a better handling of the page information is better. hopefully soon availlable. philipp

Am 09.08.2016 11:30 schrieb "christian-vigh-phpclasses" < notifications@github.com>:

Hello Philipp,

Yes this is what I said in one of my mails : I recently discovered that objects could be grouped into “object streams”, that I was not processing so far.

This explained why sometimes I found references to objects in some PDF files without being able to find their definition : they were defined in object streams.

Now they are correctly processed. Of course, processing them unveils a lot of new information. For example, this explains missing text paragraphs. But it also unveiled additional information, such as font information and page contents descriptions.

This is the case with your sample : page information was not processed before version 1.2.30. Unfortunately, in your sample, this page information is presented in a format I was not aware of, so I have to rework a little bit this part of my code (well, to tell the truth, I will be a little bit hard…).

Before version 1.2.30, no page information was found in your sample (because it was contained in an object stream, which was unprocessed). In such situations, my class processes all the objects in the order they arrive. This is why the text output was in the right order (but keep in mind that this is not the case for all samples…).

Until I implement a new correct way of processing page information, do you think it could be helpful for you if I added a flag for the $options parameter of the constructor that says : “don’t process page information”, so that the class will behave as before regarding contents processing ?

Christian.


De : phisu [mailto:notifications@github.com] Envoyé : mardi 9 août 2016 09:55

À : christian-vigh-phpclasses/PdfToText Cc : christian-vigh-phpclasses; Comment Objet : Re: [christian-vigh-phpclasses/PdfToText] problems with german umlaut (#7)

hello Christian,

i tried to take a closer look on the extracted text of http://www.dse-wien.at/fileadmin/media/downloads/DSE-Kataloge/150701-DSE-Kat alog-verlinkt.pdf with the version 1.2.35 of your class. but it is hard to examine if the extracted text is complete because the outputted text is not in the same order like the pages of the pdf.

an example:

if find the text of the last paragraph of page 6

Positive Effekte der Beschäftigung in einem Sozialintegrativen Unternehmen zeigen sich auch im weiteren Erwerbsleben: Personen, die in Form von Transitarbeitsplätzen gefördert wurden, sind in den Jahren danach seltener und weniger lang von Arbeitslosigkeit betroffen. Auch das Einkommen der Geförderten fällt deutlich höher aus: Ehemalige Transitarbeitskräfte erzielen in den folgenden Jahren durchschnittlich ein um ein Drittel höheres Einkommen als vergleichbare ungeförderte Per- sonen. So leisten Sozialintegrative Unternehmen einen wichtigen Beitrag zur Armutsvermeidung und zur sozialen Inklusion.

after that the subheading of page 5

Sozialintegrative Unternehmen

then the text of page 16 then a subheading without a linebreak before (of which page i cannot examine, because there same subheadings on different pages around page 16)

Tätigkeitsfeld

then the text of page 21 but not the heading:

Die KÜMMEREI ist ein Projekt der Job-TransFair GmbH, einem Tochterunternehmen des BFI Wien, kümmert sich darum Schulen auszumalen, in Kindergärten verfliesen, Lehrlinge zu verköstigen, ... und vor allem darum, Ihnen die Möglichkeit zum Arbeiten und zum Lernen zu geben.

etc.

after the paragraphs on the left of the page 21 follows the headings of this page and after that the paragraph on the right side of this page.

after that it follows without a space the text of page 26.

philipp

— You are receiving this because you commented. Reply to this email directly, view https://github.com/christian-vigh-phpclasses/PdfToText/issues/7#issuecommen t-238480338 it on GitHub, or mute https://github.com/notifications/unsubscribe-auth/ARM8aqVUR421ekvF75_j9uKvb _eN4cDmks5qeDI4gaJpZM4JTT0E the thread. https://github.com/notifications/beacon/ARM8agjIEwBwMj4a5Dbw3r-_-FBqls72ks5 qeDI4gaJpZM4JTT0E.gif


L'absence de virus dans ce courrier électronique a été vérifiée par le logiciel antivirus Avast. https://www.avast.com/antivirus

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/christian-vigh-phpclasses/PdfToText/issues/7#issuecomment-238501735, or mute the thread https://github.com/notifications/unsubscribe-auth/AC-Jo8-Wf0ZVxEdHiR4YLA8YOyB_P6p1ks5qeEiogaJpZM4JTT0E .

christian-vigh-phpclasses commented 8 years ago

Hello Philipp,

I have published version 1.2.39, where I completely rewrote the method used to scan for page contents. Everything should appear now in order (or I hope so !)

Christian.

PS : there are still problems with text positioning (line breaks, extra spaces or no spaces at all) and sometimes improper character translation (this one is due to the use of the same font aliases for different objects at various parts of the document). This is to be fixed later.


De : phisu [mailto:notifications@github.com] Envoyé : mardi 9 août 2016 09:55 À : christian-vigh-phpclasses/PdfToText Cc : christian-vigh-phpclasses; Comment Objet : Re: [christian-vigh-phpclasses/PdfToText] problems with german umlaut (#7)

hello Christian,

i tried to take a closer look on the extracted text of http://www.dse-wien.at/fileadmin/media/downloads/DSE-Kataloge/150701-DSE-Kat alog-verlinkt.pdf with the version 1.2.35 of your class. but it is hard to examine if the extracted text is complete because the outputted text is not in the same order like the pages of the pdf.

an example:

if find the text of the last paragraph of page 6

Positive Effekte der Beschäftigung in einem Sozialintegrativen Unternehmen zeigen sich auch im weiteren Erwerbsleben: Personen, die in Form von Transitarbeitsplätzen gefördert wurden, sind in den Jahren danach seltener und weniger lang von Arbeitslosigkeit betroffen. Auch das Einkommen der Geförderten fällt deutlich höher aus: Ehemalige Transitarbeitskräfte erzielen in den folgenden Jahren durchschnittlich ein um ein Drittel höheres Einkommen als vergleichbare ungeförderte Per- sonen. So leisten Sozialintegrative Unternehmen einen wichtigen Beitrag zur Armutsvermeidung und zur sozialen Inklusion.

after that the subheading of page 5

Sozialintegrative Unternehmen

then the text of page 16 then a subheading without a linebreak before (of which page i cannot examine, because there same subheadings on different pages around page 16)

Tätigkeitsfeld

then the text of page 21 but not the heading:

Die KÜMMEREI ist ein Projekt der Job-TransFair GmbH, einem Tochterunternehmen des BFI Wien, kümmert sich darum Schulen auszumalen, in Kindergärten verfliesen, Lehrlinge zu verköstigen, ... und vor allem darum, Ihnen die Möglichkeit zum Arbeiten und zum Lernen zu geben.

etc.

after the paragraphs on the left of the page 21 follows the headings of this page and after that the paragraph on the right side of this page.

after that it follows without a space the text of page 26.

philipp

— You are receiving this because you commented. Reply to this email directly, view https://github.com/christian-vigh-phpclasses/PdfToText/issues/7#issuecommen t-238480338 it on GitHub, or mute https://github.com/notifications/unsubscribe-auth/ARM8aqVUR421ekvF75_j9uKvb _eN4cDmks5qeDI4gaJpZM4JTT0E the thread. https://github.com/notifications/beacon/ARM8agjIEwBwMj4a5Dbw3r-_-FBqls72ks5 qeDI4gaJpZM4JTT0E.gif


L'absence de virus dans ce courrier électronique a été vérifiée par le logiciel antivirus Avast. https://www.avast.com/antivirus

phisu commented 8 years ago

hello Christian, i tested your new version 1.2.43: the output of https://www.wien.gv.at/umweltschutz/oekokauf/pdf/reinigung.pdf is empty.

the output of http://www.dse-wien.at/fileadmin/media/downloads/DSE-Kataloge/150701-DSE-Katalog-verlinkt.pdf hast mostly only chinese characters. in a previous version it was correct.

same for http://www.cleanclothes.at/media/common/uploads/download/cck-label-check/CCK-LabelCheck_screen.pdf

and same here: http://images.umweltberatung.at/htm/fisch-infobl-ernaehrung.pdf and here http://www.nachhaltigebeschaffung.at/sites/default/files/nb_eofg_leitfaden05032015_webversion.pdf

there are some problems in http://www.oekoevent.at/uploads/2010/09/factsheet_giveaways_2007.pdf in example page 1: output:

¾kologische Give-AWays fàr 6eranstaltungen

for

Ökologische Give-Aways für Veranstaltungen

philipp

christian-vigh-phpclasses commented 8 years ago

Hello Philipp,

First of all, thanks for your quick feedback (which made me scratch my head a few hours today…) ! I will try to answer to your different issues point by point :

1) Files “150701-DSE-Katalog.pdf” through “factsheet_giveaways_2007.pdf” :

a. The apparition of Chinese characters was a regression ; I once tried to handle a sample that had been sent to me, which contained several paragraphs written in middle- and far-east languages, and printed using PrimoPdf. This was using a weird way of specifying double-byte character codes : instead of providing hexadecimal representations, it gave plain-text characters that I had to convert first to their numeric counterpart ; that was working, but I didn’t notice it affected pdf samples like yours, which were using real plain-text characters. I temporarily disabled this feature until I’ll find out how to differentiate plain-text expressing character values to be converted to hex, from plain-text expressing real text…

b. I thought I already explained this one, but that was maybe to another user ! the bad mapping of characters (“¾kologische” instead of “Ökologische”) is due to the fact that your samples redefine several times the same font alias, using different character maps. Apparently, they are redefined at a page or object level, which my class does not handle yet (PdfToText currently handles font aliases at a document level). For example, in file “150701-DSE-Katalog-verlinkt.pdf”, there is an alias for a font, named “/C0_1”, which has a correct correspondence for the “Ö” character ; but there is a second alias with the same name, later in the file, which substitutes the same entry with “¾”. Being the last one, it wins over the first one… In my super bug tracking system (an Excel file), this is known as issue #69, and I’m still thinking about how to solve it.

2) File “reinigung.pdf” : this is a new issue (#91 in my bug tracking system). In debug mode, it shows to me that a lot of objects in the pdf file have not been recognized ; I guess that most of them are text objects, which explains why no text has been extracted.

I published a quick fix (version 1.2.44) which temporarily disables the feature described in issue 1.a), so that you will have no Chinese characters any more.

For issue 1.b), you’ll have to wait a little bit, because of the potential complexity of the task…

Regarding issue 2), I will try to investigate this evening to see what’s happening. I think it outlines yet another way to encode PDF data.

With kind regards,

Christian.


De : phisu [mailto:notifications@github.com] Envoyé : dimanche 21 août 2016 13:09 À : christian-vigh-phpclasses/PdfToText Cc : christian-vigh-phpclasses; Comment Objet : Re: [christian-vigh-phpclasses/PdfToText] problems with german umlaut (#7)

hello Christian, i tested your new version 1.2.43: the output of https://www.wien.gv.at/umweltschutz/oekokauf/pdf/reinigung.pdf is empty.

the output of http://www.dse-wien.at/fileadmin/media/downloads/DSE-Kataloge/150701-DSE-Kat alog-verlinkt.pdf hast mostly only chinese characters. in a previous version it was correct.

same for http://www.cleanclothes.at/media/common/uploads/download/cck-label-check/CCK -LabelCheck_screen.pdf

and same here: http://images.umweltberatung.at/htm/fisch-infobl-ernaehrung.pdf and here http://www.nachhaltigebeschaffung.at/sites/default/files/nb_eofg_leitfaden05 032015_webversion.pdf

there are some problems in http://www.oekoevent.at/uploads/2010/09/factsheet_giveaways_2007.pdf in example page 1: output:

¾kologische Give-AWays fàr 6eranstaltungen

for

Ökologische Give-Aways für Veranstaltungen

philipp

— You are receiving this because you commented. Reply to this email directly, view https://github.com/christian-vigh-phpclasses/PdfToText/issues/7#issuecommen t-241251500 it on GitHub, or mute https://github.com/notifications/unsubscribe-auth/ARM8avkNvb2OdyoaMSK50ue6C yIKwoIVks5qiDHhgaJpZM4JTT0E the thread. https://github.com/notifications/beacon/ARM8aiFNFH__Znc0fw-PeIhGNWnvablCks5 qiDHhgaJpZM4JTT0E.gif


L'absence de virus dans ce courrier électronique a été vérifiée par le logiciel antivirus Avast. https://www.avast.com/antivirus

christian-vigh-phpclasses commented 8 years ago

Hello Philipp,

I have some relatively good news about file reinigung.pdf ; it has been created by Microsoft Word 2010 and I didn’t expect such a mess !

Some objects are duplicated throughout the file ; hopefully, duplicated objects contain the same information so it does not perturbate my class. But it’s really curious that Microsoft generates PDF files of such a poor quality.

However, the issue about file reinigung.pdf comes from ancient times : I once received a sample where some text objects contained only header and footer information. I wrongly assumed that text objects contained either header/footer data or page contents and discarded the ones containing header/footer information.

I discarded them because I didn’t know (and still do not know) how to correctly handle header/footer information so that my API allows the developer to manipulate them correctly.

Of course, this approach did not work with file reinigung.pdf, because all the text objects contains header/footer information AND page contents. Being discarded for this reason, this explains why my class did not output anything.

I made a fix with version 1.2.45 which solves the issue.

You will notice however that there are still some spurious characters : this comes from bad character mapping due to the fact that the file redefines the same font aliases with different character maps (remember ? issue #69 in my previous reply). Still working on this…

The reinigung.pdf sample was a good experience for me, because it has led me a step further towards handling header and footer information, which I’ll implement in the future.

With kind regards,

Christian.


De : phisu [mailto:notifications@github.com] Envoyé : dimanche 21 août 2016 13:09 À : christian-vigh-phpclasses/PdfToText Cc : christian-vigh-phpclasses; Comment Objet : Re: [christian-vigh-phpclasses/PdfToText] problems with german umlaut (#7)

hello Christian, i tested your new version 1.2.43: the output of https://www.wien.gv.at/umweltschutz/oekokauf/pdf/reinigung.pdf is empty.

the output of http://www.dse-wien.at/fileadmin/media/downloads/DSE-Kataloge/150701-DSE-Kat alog-verlinkt.pdf hast mostly only chinese characters. in a previous version it was correct.

same for http://www.cleanclothes.at/media/common/uploads/download/cck-label-check/CCK -LabelCheck_screen.pdf

and same here: http://images.umweltberatung.at/htm/fisch-infobl-ernaehrung.pdf and here http://www.nachhaltigebeschaffung.at/sites/default/files/nb_eofg_leitfaden05 032015_webversion.pdf

there are some problems in http://www.oekoevent.at/uploads/2010/09/factsheet_giveaways_2007.pdf in example page 1: output:

¾kologische Give-AWays fàr 6eranstaltungen

for

Ökologische Give-Aways für Veranstaltungen

philipp

— You are receiving this because you commented. Reply to this email directly, view https://github.com/christian-vigh-phpclasses/PdfToText/issues/7#issuecommen t-241251500 it on GitHub, or mute https://github.com/notifications/unsubscribe-auth/ARM8avkNvb2OdyoaMSK50ue6C yIKwoIVks5qiDHhgaJpZM4JTT0E the thread. https://github.com/notifications/beacon/ARM8aiFNFH__Znc0fw-PeIhGNWnvablCks5 qiDHhgaJpZM4JTT0E.gif


L'absence de virus dans ce courrier électronique a été vérifiée par le logiciel antivirus Avast. https://www.avast.com/antivirus

christian-vigh-phpclasses commented 7 years ago

Hello Philipp,

My best wishes for 2017 !

I’m coming back to you because today I solved an issue that may concern the « German umlauts problem” you were facing.

I also extended the class to process additional image formats. To take the file 150701-DSE-Katalog-verlinkt.pdf as an example, it is now able to extract more than 110 images, to be compared with the little dozen the previous versions were able to extract. I still have problems with a few of them (especially those marked as gray-scaled images, which do not render correctly), but it’s a great step forward !

And finally, I added the ability to auto-extract images without keeping them into memory. Have a look at the README.md file for more information.

With kind regards,

Christian.


De : phisu [mailto:notifications@github.com] Envoyé : dimanche 21 août 2016 13:09 À : christian-vigh-phpclasses/PdfToText Cc : christian-vigh-phpclasses; Comment Objet : Re: [christian-vigh-phpclasses/PdfToText] problems with german umlaut (#7)

hello Christian, i tested your new version 1.2.43: the output of https://www.wien.gv.at/umweltschutz/oekokauf/pdf/reinigung.pdf is empty.

the output of http://www.dse-wien.at/fileadmin/media/downloads/DSE-Kataloge/150701-DSE-Kat alog-verlinkt.pdf hast mostly only chinese characters. in a previous version it was correct.

same for http://www.cleanclothes.at/media/common/uploads/download/cck-label-check/CCK -LabelCheck_screen.pdf

and same here: http://images.umweltberatung.at/htm/fisch-infobl-ernaehrung.pdf and here http://www.nachhaltigebeschaffung.at/sites/default/files/nb_eofg_leitfaden05 032015_webversion.pdf

there are some problems in http://www.oekoevent.at/uploads/2010/09/factsheet_giveaways_2007.pdf in example page 1: output:

¾kologische Give-AWays fàr 6eranstaltungen

for

Ökologische Give-Aways für Veranstaltungen

philipp

— You are receiving this because you commented. Reply to this email directly, view https://github.com/christian-vigh-phpclasses/PdfToText/issues/7#issuecommen t-241251500 it on GitHub, or mute https://github.com/notifications/unsubscribe-auth/ARM8avkNvb2OdyoaMSK50ue6C yIKwoIVks5qiDHhgaJpZM4JTT0E the thread. https://github.com/notifications/beacon/ARM8aiFNFH__Znc0fw-PeIhGNWnvablCks5 qiDHhgaJpZM4JTT0E.gif


L'absence de virus dans ce courrier électronique a été vérifiée par le logiciel antivirus Avast. https://www.avast.com/antivirus

christian-vigh-phpclasses commented 7 years ago

Hello Philipp,

I finally solved the last problems (or I hope so) concerning this file (factsheets_giveaway_2007.pdf).

The last problems that remained could be find in the first page of the document :

Ökologische Give-Aways

für Veranstaltungen

where the “Ö” and the “ü” were replaced by some other characters.

The latest version 1.3.16 solves this issue.

Please feel free to contact me if you have any question or issue.

With kind regards,

Christian.


De : phisu [mailto:notifications@github.com] Envoyé : lundi 8 août 2016 11:16 À : christian-vigh-phpclasses/PdfToText Cc : christian-vigh-phpclasses; Comment Objet : Re: [christian-vigh-phpclasses/PdfToText] problems with german umlaut (#7)

hello Christian,

thank you a lot for your work! i report you an other example of the encoding issue. maybe it confirms your understanding of the unicode translation in pdf files. for me its a strange example, because you can find a correct output of german umlauts and wrong output of german umlauts and other letters in one line:

http://www.oekoevent.at/uploads/2010/09/factsheet_giveaways_2007.pdf on page 2 you can find the following paragraph:

Impressum Herausgeberin: Stadt Wien, Geschäftsgruppe Umwelt, 1082 Wien, Rathaus Autorinnen: Mag Henriette Gupfinger, Mag Andrea Ebner Österreichische Gesellschaft für Umwelt und Technik - ÖGUT 1020 Wien, Hollandstrasse 10/46

output:

Impressum Herausgeberin� Stadt Wien, 'eschäftsgruppe 5mwelt, 10�2 Wien, Rathaus Autorinnen� Mag a Henriette 'upfinger, Mag a Andrea %bner ¾sterreichische 'esellschaft fàr 5mwelt und Technik - ¾'5T 1020 Wien, Hollandstrasse 10���

i am using the following version of your class: [Version : 1.2.35] [Date : 2016/08/06]

philipp

— You are receiving this because you commented. Reply to this email directly, view https://github.com/christian-vigh-phpclasses/PdfToText/issues/7#issuecomment-238183280 it on GitHub, or mute https://github.com/notifications/unsubscribe-auth/ARM8ajTXCW4dPwm5gGGc8BVBT8Mc2SZCks5qdvPbgaJpZM4JTT0E the thread. https://github.com/notifications/beacon/ARM8ap20Hgcc586Tslk_keabFFivfIMtks5qdvPbgaJpZM4JTT0E.gif


L'absence de virus dans ce courrier électronique a été vérifiée par le logiciel antivirus Avast. https://www.avast.com/antivirus