Copying and pasting identifiers in PDF output

texdraft commented 3 years ago

As is decently well known, plain TeX's “underscore” character is actually

{\leavevmode \kern.06em \vbox{\hrule width.3em}}

It looks nice enough, but when copying and pasting text that has an underscore, it turns into a space. The usual way to fix this is to mess with text encoding, and I figured dealing with that would not be worth the trouble. However, I recently found out about a PDF feature called ActualText that allows specifying alternative text for an element of the page. See for example this question on TeX.SE.

Not only could this be used for making underscores copy as underscores, but it could also be used for identifiers formatted as Tex, since they could render completely differently from how they appear in the C source code. Whether it would be worth doing for characters like ∧ and ¬ (to make them copy as && and !) is up for debate.

Is this capability something that belongs in CWEB, perhaps as an option (so that extra TeX would not be output if PDF is not the target)? Would it be desirable? If so, then I will implement it.

ascherer commented 3 years ago

It would also help with searching in the PDF files. I fear that the macro (?) code would bloat the \ifacro parts of cwebmac.tex some more. Maybe a separate macro file like pdfwebtocfront.tex might be an idea?

texdraft commented 3 years ago

(Searching is a much better use case than copying and pasting; I can't believe it slipped my mind.)

I think there are numerous solutions. First of all, \\ and \| can be changed only at the TeX level, without any modifications to CWEAVE. (Changing \| is necessary for the case of \|\_.) However, it might be easier to have the macros deal with unescaped underscores, so CWEAVE might be changed to output them verbatim; \\ and \| would take care of the escaping. No special treatment would be necessary for sanitizing names, since none of the delimiting characters in PDF syntax can appear in a C identifier, so no parsing is required (as far as I know).

For custom identifiers, CWEAVE could wrap them in a macro call to something like \CI (for “custom identifier”) that would look like this:

\CI{\skipxTeX}{skip_TeX} % or maybe skip\_TeX

In the output you would see \skipxTeX typeset, and skip_TeX would be the ActualText text. An alternative would be to require users who want this feature to add something to their custom identifier macro definitions that will insert the ActualText into the PDF.

One thing that had me worried was the “granularity” of ActualText, but it turns out that you can apply it to pretty much any span of text, so it could capture an entire identifier.

Should CWEAVE's behavior be changed, a new control code could be added that allows specifying an identifier's ActualText text, although that would probably be overkill. (I can't imagine it being very useful.)

ascherer / cweb

Copying and pasting identifiers in PDF output #21