Text cannot be copied/extracted from neatpdf-generated files with OpenType fonts

vuori commented 3 years ago

PDF files generated by neatpdf look fine in a PDF viewer, but the text in them cannot be copied to the clipboard nor extracted with pdftotext when (some?) OpenType fonts are used. For example the following input file:

.fp - M Minion-Pro-Regular
.fp - G Garamond-Premier-Pro
.ft R
R: “Vimperator Vjol” fi
.sp
.ft M
Minion Pro: “Vimperator Vjol” fi
.sp
.ft G
Garamond Premier Pro: “Vimperator Vjol” fi

converted to PDF which is then run through pdftotext results in the following output:

R: “Vimperator Vjol” fi
.JOJPO 1SP7JNQFSBUPS 7KPMw ĕ
(BSBNPOE 1SFNJFS 1SP7JNQFSBUPS 7KPMu ઔ

So text which uses a Postscript font (the default R) comes through fine, text that uses the two Adobe OpenType fonts is garbled. I didn't test with a traditional TrueType font yet.

It is possible to work around this by using neatpdf and ps2pdf, but do you think this could be fixed in metapdf?

Use case: I'm updating my CV and it looks great, but unfortunately some companies process applications with automated systems and a non-machine-readable PDF may be a problem.

aligrudi commented 3 years ago

vuori notifications@github.com wrote:

PDF files generated by metapdf look fine in a PDF viewer, but the

Do you mean neatpdf (Neatroff's PDF post-processor)?

text in them cannot be copied to the clipboard nor extracted with pdftotext when (some?) OpenType fonts are used. For example the following input file:
..fp - M Minion-Pro-Regular
..fp - G Garamond-Premier-Pro
..ft R
R: “Vimperator Vjol” fi
..sp
..ft M
Minion Pro: “Vimperator Vjol” fi
..sp
..ft G
Garamond Premier Pro: “Vimperator Vjol” fi
converted to PDF which is then run through pdftotext results in the following output:
R: “Vimperator Vjol” fi
..JOJPO 1SP7JNQFSBUPS 7KPMw ĕ
(BSBNPOE 1SFNJFS 1SP7JNQFSBUPS 7KPMu ઔ
So text which uses a Postscript font (the default R) comes through fine, text that uses the two Adobe OpenType fonts is garbled. I didn't test with a traditional TrueType font yet.

Does the output of both neatpost and neatpdf have this problem?

It is possible to work around this by using metapost and ps2pdf, but do you think this could be fixed in metapdf?

The Adobe's PDF Reference has a section on extracting text from PDF (§5.9). I have to examine how much work that requires.

Ali

vuori commented 3 years ago

Yes, sorry, it was getting pretty late and I kept writing "meta" when I meant "neat". The problem only occurs with neatpdf. Postscript output from neatpost has no problems.

The page object in the PDF output by neatpdf starts like this:

/Times-Roman.0 10 Tf
1 0 0 1 72.00 780.00 Tm
[<321a> -250 (h6) 20 (IM) 10 (PERA) 10 (T) 10 (OR) -250 (6) 30 (JOLv) -250 (l)] TJ
/MinionPro-Regular 10 Tf
1 0 0 1 72.00 756.00 Tm
[<002e> 10 <004a004f004a0050> 10 <004f> -230 <0031> 10 <0053> 10 <0050001b> -230 <0069> 10 <0037> 50 <004a004e> 20 <0051> -10 <0046005300420055> 10 <0050> 10 <0053> -230 <0037> 50 <004b> -10 <0050> 10 <004d> 30 <0077> -230 <0115>] TJ

Since Identity-H mapping is being used for the OpenType fonts, I guess the arguments to the second TJ command are CIDs? Maybe the problem is the lack of a "ToUnicode CMap" (PDF spec 9.10.3) as described here: https://tex.stackexchange.com/questions/526157/what-is-identity-h-encoding-should-it-be-avoided-and-if-so-how ?

aligrudi / neatroff_make

Text cannot be copied/extracted from neatpdf-generated files with OpenType fonts #7