sanitize catcode of & at loading time?

blefloch / latex-unravel

Watching TeX digest tokens

24 stars 1 forks source link

sanitize catcode of & at loading time? #19

Closed jfbu closed 8 years ago

jfbu commented 8 years ago

Hi, I accidentally noticed that an unusual catcode for & at loading time

\documentclass{article}
\catcode`& 12
\usepackage{unravel}
\catcode`& 4
\begin{document}
\unravel{\romannumeral-`0 1}
\end{document}

compromised the functioning of the package. I hesitate raising this at an issue as I suspect it is an implicit assumption in LaTeX2e that packages are supposedly loaded under a standard catcode regime (I tried with a few other characters, but finally settled on the & as the culprit in my initial, a bit convoluted, situation).

Best, Jean-François

blefloch commented 8 years ago

That's a bug indeed, as unravel should strive to work even when subject to some abuse.

It is due to expl3: the following breaks.

\catcode`&=12
\RequirePackage{expl3}
\catcode`&=4
\ExplSyntaxOn
\bool_if:nTF{ \c_true_bool && \c_false_bool } { } { }

Now, I'm not sure I would count it as a bug in expl3 because that package is not meant to sustain arbitrary abuse.

So I don't know what's the best place to fix this. Do you think unravel should simply sanitize all catcodes when it is loaded? Also, why do you need this? :)

jfbu commented 8 years ago

I haven't looked at unravel code, but if it uses tokens such as & as delimiters perhaps it should sanitize them. xint sanitizes all character tokens it needs. I regret LaTeX2e does not have a macro \standardcatcodes that package authors could put at the top of their .sty files, to be extra sure (with a suitable \restorecatcodes to be used at the end). I had to reinvent the wheel for xint.

I didn't need a catcode 12 &, it arose as a combination of factors which ended up putting to light a bug at the start of the code of xintkernel.sty: it used an \aftergroup\endinput to abort its loading when loaded for the second time --- in LaTeX this would not normally arise because the second \usepackage{xintkernel} would have been intercepted by the format, but the issue nevertheless occurred due to another chain of causes and effects going back to the fact that my dev version of bnumexpr did not identify itself correctly to LaTeX.

The \endgroup came from expansion of a macro. All material following the \endgroup got executed despite the \endinput ! I should have been aware of that ... and this included the infamous setting up catcodes. You can observe the bug with \input xintkernel.sty \input xintkernel.sty in Plain. I have fixed it for next release.

jfbu commented 8 years ago

When I said xint sanitizes catcodes, naturally not those of letters, digits, \, and %. Besides I didn't reinvent the wheel, I initially more or less followed what I saw in HO's packages to handle being loaded in an uncontrolled catcode situation. xintkernel.sty does the job and sets up macros that the other components use (after an initial, cautious start à la HO). One could have imagined that LaTeX2e would have provided such a facility, but as xint runs under Plain too, that would not have spared the coding. Sorry for moving away from unravel: will LaTeX3 provide package authors such a sanitizing macros ? or will it be assumed that anyhow users do not change catcodes ?

blefloch commented 8 years ago

The silly thing is that & is only used in a boolean expression run once an \unravel is done, to check if everything went well.

For loading packages, I just looked through what I thought might be good solutions (plainpkg and related packages), but they don't seem to be nearly as paranoid as they should.

I propose that I write a package package (any better name welcome) used as follows:

\input package.sty\relax
% now catcodes are safe, and we give some information
% as a braced argument for a macro that looks beyond
% the end of package.sty
{
  package = xint ,
  version = 1.2.3b ,
  date = 2015/09/28 ,
  letters = { \@ \_ } ,
}
% code of xint
% ...
\input package.sty\relax
\endinput

where package.sty sets all catcodes as expected (well, the first 128, say) and announces the package through \ProvidesFile or \ProvidesPackage or \ProvidesClass. I can make it work as long as a few primitives have their original meaning, backslash has catcode 0, space has catcode 10 (otherwise \input package.sty\relax would fail anyway), most lowercase letters have their original catcode. My plan is to make package.sty only define internal commands, hence the need for loading it again at the end rather than provide a \EndOfPackage command to restore catcodes. The two uses are distinguished by the fact that the first is followed by a braced argument while the second is followed by \endinput or by the end of the file or by anything else, really (and I could let a braced argument {end} also denote the end).

One thing that worries me is to make it work when loaded under \ExplSyntaxOn and also in a hypothetical LaTeX3 format where none of the primitives will be accessible under their original TeX name.

jfbu commented 8 years ago

Thus package will also handle avoiding double loading ? Could it also recognize etex=required key-value ? and I presume things like math = {$,~}. For xint anyhow, the work was done with an idea to minimize formalism: it can be loaded under the requirement that letters (lowercase and uppercase), digits, the % and the \ have their standard catcodes. I uploaded an xintkernel.sty fixing the \aftergroup\endinput bug to the bnumexpr/index.html page on my personal site.

As per the name, xpack comes to mind: but the package should really work under only Plain and nothing more needed.

Regarding your last sentence, this makes me shiver in anguish. I just can't visualize developing under such dictatorial framework which makes original primitives inaccessible. If only that was for a TeX entirely re-written in Lisp, ok, but if you know that underneath there is a suffering enslaved TeX, that's too much to ask for.

blefloch commented 8 years ago

Yes, I think it makes sense to have a package that takes care of

double loading
making catcodes sane (and endlinechar etc), giving the ability to choose various schemes (document-level, package, expl3) and tweak things (if one wants particular chars to be private letters for instance)
engine requirements (does this need eTeX? pdfTeX? XeTeX?)
format requirements (can this code load under all engines? only LaTeX2e?)
package requirements (before loading, or at some point before \begin{document} in LaTeX2e)
more refined requirements concerning the existence of specific primitives (requiring that they are primitives, or simply that they are defined)

Not sure I understand your comment about math = {$,~} (especially the ~). The package should work under iniTeX even.

Let me clarify what I meant by making the primitives inaccessible: they are available as \tex_input:D and the like. They cannot be used directly in a document without using \ExplSyntaxOn to switch to code catcodes. The reason is that primitives have names that would be perfectly good document-level names for other things (e.g. \box could draw a box, or whatever), and that users most often do not need primitives directly (even package writers typically don't need them that much either once a proper programming framework is available).

jfbu commented 8 years ago

Agreed with all your points about the needed functionalities for a package package.

math = {$, ~} was indeed very weird: it came to mind because for some reason related to \xintNewExpr, xint sets the catcode of ~ to 3 for the duration of its loading. My comment meant only that besides letter = { _,:, ?, !, ...} there could be the need for other = {^, &, ...}, etc... but that was implicit in your earlier comment and indeed one doesn't see much use for mathshift = {&} for example.

About names of primitives: the problem originates in bad pratices of package authors in distant (?) past, like, let's invent some imaginary examples, providing the user with non-prefixed macros such \round, \question, \nombre, \convert, \ifequal, \ifthen etc...

The probability of a user overwriting one of the TeX primitives is low, and easy to document (don't use one of those framed in all reference Books), and \newcommand already provides protection. On the other hand the probability of packages entering in conflict is very high if they do things like the above. And they do. Now this can be cured by a better ethic of package authors.

The logic behind making the primitives available only under new names imposes to the macro programmer to become a LaTeX3 programmer or vanish in no-where land. Perhaps I misunderstand but it seems the user under LaTeX3 regime will not be authorized to load code written with original TeX primitives ?

blefloch commented 8 years ago

Ah, I understand your point about catcodes. I think I'd do something like catcodes = { standard , math = { ~ }, letter = { @ _ : } } with the catcode option "standard" applying first, then the following keys applying on top of that.

No, the problem with primitives (or more generally standard commands) is that they clash with stuff that users want to redefine. Say, \box for \square (the wave equation is typically said "box phi equals zero" not "square phi equals zero"), or \span for the linear span, or \leaders for a variable containing a list of people, or two-letter commands which people may want to redefine to pretty much anything by accident: \cr (for a calligraphic "r"), \dp, \fi ($f_i$), \ht, \or, \wd. So I support the team decision of making primitives only accessible under their code-level name.

I don't think it is so easy to tell users not to define any of a list of 530 primitives (in pdfTeX) and more if you include things like \empty which every current TeX programmer expects to have available. I have some hope of semi-automatically converting current TeX packages to the new syntax. Of course that will leave a code littered with TeX primitives under their new names, which is not terribly good style in expl3. At least, such code would still be usable in LaTeX3.

blefloch commented 8 years ago

This issue is actually fixed by one of my recent commits. I'm closing the issue but will still reply to any comment here.