Open vuori opened 3 years ago
vuori notifications@github.com wrote:
PDF files generated by
metapdf
look fine in a PDF viewer, but the
Do you mean neatpdf (Neatroff's PDF post-processor)?
text in them cannot be copied to the clipboard nor extracted with
pdftotext
when (some?) OpenType fonts are used. For example the following input file:..fp - M Minion-Pro-Regular ..fp - G Garamond-Premier-Pro ..ft R R: “Vimperator Vjol” fi ..sp ..ft M Minion Pro: “Vimperator Vjol” fi ..sp ..ft G Garamond Premier Pro: “Vimperator Vjol” fi
converted to PDF which is then run through
pdftotext
results in the following output:R: “Vimperator Vjol” fi ..JOJPO 1SP7JNQFSBUPS 7KPMw ĕ (BSBNPOE 1SFNJFS 1SP7JNQFSBUPS 7KPMu ઔ
So text which uses a Postscript font (the default
R
) comes through fine, text that uses the two Adobe OpenType fonts is garbled. I didn't test with a traditional TrueType font yet.
Does the output of both neatpost and neatpdf have this problem?
It is possible to work around this by using
metapost
andps2pdf
, but do you think this could be fixed inmetapdf
?
The Adobe's PDF Reference has a section on extracting text from PDF (§5.9). I have to examine how much work that requires.
Ali
Yes, sorry, it was getting pretty late and I kept writing "meta" when I meant "neat". The problem only occurs with neatpdf. Postscript output from neatpost has no problems.
The page object in the PDF output by neatpdf starts like this:
/Times-Roman.0 10 Tf
1 0 0 1 72.00 780.00 Tm
[<321a> -250 (h6) 20 (IM) 10 (PERA) 10 (T) 10 (OR) -250 (6) 30 (JOLv) -250 (l)] TJ
/MinionPro-Regular 10 Tf
1 0 0 1 72.00 756.00 Tm
[<002e> 10 <004a004f004a0050> 10 <004f> -230 <0031> 10 <0053> 10 <0050001b> -230 <0069> 10 <0037> 50 <004a004e> 20 <0051> -10 <0046005300420055> 10 <0050> 10 <0053> -230 <0037> 50 <004b> -10 <0050> 10 <004d> 30 <0077> -230 <0115>] TJ
Since Identity-H mapping is being used for the OpenType fonts, I guess the arguments to the second TJ
command are CIDs? Maybe the problem is the lack of a "ToUnicode CMap" (PDF spec 9.10.3) as described here: https://tex.stackexchange.com/questions/526157/what-is-identity-h-encoding-should-it-be-avoided-and-if-so-how ?
PDF files generated by
neatpdf
look fine in a PDF viewer, but the text in them cannot be copied to the clipboard nor extracted withpdftotext
when (some?) OpenType fonts are used. For example the following input file:converted to PDF which is then run through
pdftotext
results in the following output:So text which uses a Postscript font (the default
R
) comes through fine, text that uses the two Adobe OpenType fonts is garbled. I didn't test with a traditional TrueType font yet.It is possible to work around this by using
neatpdf
andps2pdf
, but do you think this could be fixed inmetapdf
?Use case: I'm updating my CV and it looks great, but unfortunately some companies process applications with automated systems and a non-machine-readable PDF may be a problem.