alexivkin / CUPS-PDF-to-PDF

CUPS-PDF with a patch to print text correctly
GNU General Public License v2.0
33 stars 12 forks source link

[Solved] Unsearchable Text wine #2

Open marlemion opened 7 years ago

marlemion commented 7 years ago

Hi, I have a problem reported already to wine and ghostscript, but maybe you have an idea, too:

https://bugs.winehq.org/show_bug.cgi?id=42739 https://sourceforge.net/p/ghostscript/discussion/5452/thread/44e4abf6/

It is a showstopper for creating text searchable pdfs from within wine from random docx/msg documents.

Thanks for any hint!

alexivkin commented 7 years ago

I've reproduced that issue in wine, but could only surmise that office converts data to PS before sending it to CUPS, and the text is lost in the pre-conversion. Two things you could try:

  1. Print/save as pdf from a different editor, like Google Docs, LibreOffice, Office viewer etc.
  2. Insert debug statements into cups and recompile to understand the underlying issue better. Hopefully it can be fixed inside of cups.
marlemion commented 7 years ago

Hi, thanks for your reply. I did the following:

Installed wordviewer under wine and printed the test.docx with the cups-pdf driver: same result, Calibri is not searchable.

Printed with Word under wine and printed 'to a file' (aka created a ps file) using the cups-pdf driver. This one can be converted using ghoscript ending up in the same problem. Now I used this file with Acrobat. Acrobat created a pdf file, which has exactly the same problems as the pdf file from ghostscript. So it seems to me that ghostscript is no the problem here (I spent quite some time fiddling around with fonts and ghostscript).

So, as you said, the problem seems to be that the text information gets lost somewhere in between Word->Wine->WinePS->CUPS->cups-pdf.ppd. I think I will try to install a different PS Printer and print with this one. If the result persists, it must be the underlying part (wine etc), if not, it must be CUPS/cups-pdf.

I will post my results here.

PS: So I installed the Adobe generic PS driver under wine and printed 'to a file' using this driver. The output differed just slightly (s. below). Still, the PS file is created by the WinePS driver and the issue persists. So to me it seems that the Wine guys have dig into this issue.

--- test.prn    2017-04-04 09:49:48.477219020 +0200
+++ test_adobe.prn      2017-04-04 11:21:52.835828839 +0200
@@ -1,10 +1,9 @@
 %!PS-Adobe-3.0
-%cupsJobTicket: media=A4
-%cupsJobTicket: sides=one-sided
+%cupsJobTicket: media=Letter
 %cupsJobTicket: AP_D_InputSlot=
 %%Creator: Wine PostScript Driver
 %%Title: Microsoft Word - Test
-%%BoundingBox: 18 18 576 823
+%%BoundingBox: 18 7 593 784
 %%Pages: (atend)
 %%Orientation: Portrait
 %%EndComments
@@ -25,13 +24,9 @@
 %%EndProlog
 %%BeginSetup
 mark {
-%%BeginFeature: *PageSize A4
-<</PageSize[595 842]/ImagingBBox null>>setpagedevice
-%%EndFeature
-} stopped cleartomark
-mark {
-%%BeginFeature: *Duplex None
-<</Duplex false>>setpagedevice
+%%BeginFeature: *PageSize Letter
+
+    <</DeferredMediaSelection true /PageSize [612 792] /ImagingBBox null>> setpagedevice
 %%EndFeature
 } stopped cleartomark
 %%EndSetup
@@ -39,17 +34,17 @@
 %%BeginPageSetup
 /pgsave save def
 72 300 div 72 300 div scale
-75 3433 translate
+75 3268 translate
 1 -1 scale
 0 rotate
 %%EndPageSetup
-0.00 0.00 0.00 setrgbcolor
+0.00 setgray
 25 dict begin
  /FontName /Calibri def
  /Encoding 256 array 0 1 255{1 index exch /.notdef put} for def
  /PaintType 0 def
  /FontMatrix [1 2048 div 0 0 1 2048 div 0 0] def
- /FontBBox [-1030 -629 2540 1974] def
+ /FontBBox [-975 -397 2486 1950] def
  /FontType 1 def
  /Private 7 dict begin
   /RD {string currentfile exch readhexstring pop} def
@@ -67,16 +62,16 @@
  end
 currentdict end dup /FontName get exch definefont pop
 /Calibri findfont
-[46 0 0 -46 0 0]
+[43 0 0 -43 0 0]
 makefont setfont
 gsave
-0 0 moveto
-2480 0 rlineto
-0 3508 rlineto
--2480 0 rlineto
+108 0 moveto
+2331 0 rlineto
+0 3298 rlineto
+-2331 0 rlineto
 closepath
 clip
-220 264 moveto
+315 288 moveto
 %%glyph 0064
 /Calibri findfont dup
 /Private get begin
@@ -98,7 +93,7 @@
 ND
 end end
 /g0064 glyphshow
-242 264 moveto
+336 288 moveto
 %%glyph 011e
 /Calibri findfont dup
 /Private get begin
@@ -126,7 +121,7 @@
 ND
 end end
 /g011e glyphshow
-265 264 moveto
+356 288 moveto
 %%glyph 0190
 /Calibri findfont dup
 /Private get begin
@@ -163,7 +158,7 @@
 ND
 end end
 /g0190 glyphshow
-283 264 moveto
+373 288 moveto
 %%glyph 019a
 /Calibri findfont dup
 /Private get begin
@@ -194,15 +189,15 @@
 end end
 /g019a glyphshow
 grestore
-0.00 0.00 0.00 setrgbcolor
+0.00 setgray
 gsave
-0 0 moveto
-2480 0 rlineto
-0 3508 rlineto
--2480 0 rlineto
+108 0 moveto
+2331 0 rlineto
+0 3298 rlineto
+-2331 0 rlineto
 closepath
 clip
-298 264 moveto
+388 288 moveto
 %%glyph 0003
 /Calibri findfont dup
 /Private get begin
@@ -213,7 +208,7 @@
 end end
 /g0003 glyphshow
 grestore
-0.00 0.00 0.00 setrgbcolor
+0.00 setgray
 25 dict begin
  /FontName /ArialMT def
  /Encoding 256 array 0 1 255{1 index exch /.notdef put} for def
@@ -237,16 +232,16 @@
  end
 currentdict end dup /FontName get exch definefont pop
 /ArialMT findfont
-[46 0 0 -46 0 0]
+[43 0 0 -43 0 0]
 makefont setfont
 gsave
-0 0 moveto
-2480 0 rlineto
-0 3508 rlineto
--2480 0 rlineto
+108 0 moveto
+2331 0 rlineto
+0 3298 rlineto
+-2331 0 rlineto
 closepath
 clip
-220 368 moveto
+315 385 moveto
 %%glyph 0057
 /ArialMT findfont dup
 /Private get begin
@@ -263,7 +258,7 @@
 ND
 end end
 /t glyphshow
-233 368 moveto
+327 385 moveto
 %%glyph 0048
 /ArialMT findfont dup
 /Private get begin
@@ -283,7 +278,7 @@
 ND
 end end
 /e glyphshow
-259 368 moveto
+352 385 moveto
 %%glyph 0056
 /ArialMT findfont dup
 /Private get begin
@@ -307,18 +302,18 @@
 ND
 end end
 /s glyphshow
-282 368 moveto
+373 385 moveto
 /t glyphshow
 grestore
-0.00 0.00 0.00 setrgbcolor
+0.00 setgray
 gsave
-0 0 moveto
-2480 0 rlineto
-0 3508 rlineto
--2480 0 rlineto
+108 0 moveto
+2331 0 rlineto
+0 3298 rlineto
+-2331 0 rlineto
 closepath
 clip
-294 368 moveto
+384 385 moveto
 %%glyph 0003
 /ArialMT findfont dup
 /Private get begin
marlemion commented 7 years ago

So I investigated further this issue and came across a solution.

Apparently, the culprit lies within a missing feature rather than a bug.

When reading the output of wine while printing, I came across the line:

"postscript format 3.0 glyph names are currently unsupported"

So I dug into the source code and found the respective distinction in download.c of dlls/wineps.drv/. Here, it is distinguished between the formats type 1, type 2 and type 3. Apparently, the code transforms the truetype font to an intermediate postscript font. For that, it needs to know the name of the glyphs. This is exactly extracted at this position.

However:

ttf2afm /usr/share/TTF/calibri.ttf > /dev/null

Warning: ttf2afm (file /usr/share/TTF/calibri.ttf): no names available in `post' table, print glyph names as indices

Bingo! So I understood that the culprit lies within the missing name tables of the font itself. As long as there is no support in wine for these fonts, one hase to transform the fonts including the name tables. I found a solution here:

https://github.com/fontforge/designwithfontforge.com/issues/16

And adopted it for my needs:

#! /usr/bin/env python
import fontforge
import sys

fontfile = sys.argv[1]

try:
    font = fontforge.open (fontfile)
except EnvironmentError:
    sys.exit (1)

for glyph in font:
    if font[glyph].unicode != -1:
        font[glyph].glyphname = fontforge.nameFromUnicode (font[glyph].unicode, "Adobe Glyph List")

font.save (fontfile)

For every font having no names table, I executed this script (you need fontforge for it):

#!/bin/bash
for i in *; do if [[ $(ttf2afm $i 2>&1) == *"no names available"* ]]; then echo $i && above_script.py $i && echo; fi; done
exit 0

That's it.

alexivkin commented 7 years ago

Wow! Thanks for the solution. I'll keep this issue open, so others can find it if needed.

sobuj53 commented 4 years ago

Hi, Thanks for the script. But how do we use it? I'm bit new to linux so I'd really appreciate if you can describe the steps need to use the script in printing PDF file. Thank you very much.

Ph-St commented 2 years ago

I ran into this problem today, too. @marlemion is right in their diagnosis (many thanks!) but the solution didn't quite work for me, so I thought I add my experience for anyone else encountering this issue.

First, a smaller issue, is that ttf2afm is very sensitive about the way you supply the argument. If you are in the font folder and do "ttf2afm palatino-linotype-roman.ttf" it will reply "fatal: truetype fonts file `test' not found". You have to provide the full path to the font-file.

Secondly, running the script the way @marlemion provided it on a ttf file changed the file type. While before "file palatino-linotype-roman.ttf" returned "TrueType Font data...", after running the script it returned "Spline Font Database". The result was that the font was broken in wine. This is perhaps due to changes to the default behavior of fontforge.

I'm not sure how to modify the python script to make it output a ttf (probably instead of font.save one would need font.generate with appropriate flags), but if you just have a few fonts to modify, it's possible to do it manually with fontforge. Just open fontforge, select the ttf file and set "Force glyph names to: Adobe Glyph List". Then go to File->Generate Fonts. Set the output type to "Truetype" and again set "Force glyph names to: Adobe Glyph List". Generate the font and the necessary tables are added.