Accurate source reference locators

kohlhase commented 14 years ago

[Originally Ticket 1425]

I have been accused of only issuing tickets that are impossible to fix, here is one that should be simpler

latexml offers the very nice feature of #locator that allows references back into the source. Currently, it offers backlinks of the form "at foo.tex; line 23 col 78". It would be much nicer, if we used XPointer syntax here: file.tex#textpoint(lineno,colno) or even better file.tex#textrange(startlineno,startcolno,endlineno,endcolno), if the end is also known. This would be exceedingly useful.

BTW, if latexml reads from standard input, then it generates at Anonmymous String; line 23 col 78, which contains a typo.

clange commented 14 years ago

Replying to #101 @kohlhase:

used XPointer syntax here: file.tex#textpoint(lineno,colno) or even better file.tex#textrange(startlineno,startcolno,endlineno,endcolno), if the end is also known. Great, I like this. BTW, I think within one XPointer schema (such as textpoint or textrange here), one has a lot of syntactic freedom, compare the existing standard XPointer schemas. So I wouldn't recomment textrange(startlineno,startcolno,endlineno,endcolno), as it is too easy to mix the arguments up, but rather something more structured, e.g. textrange(from=startlineno;startcolno,to=endlineno;endcolno).

kohlhase commented 14 years ago

correcting cc to include Deyan, who is most likely to fix this (@DEYAN, could you please?)

There is a prerequisite explanation in #94, where Bruce gave the relevant hints how to get the end marker. I would really like this change for sTeX and the arXMLiv branch.

kohlhase commented 14 years ago

There has been renewed interest in this from the sTeX side, where we are starting to look more into workflows and reporting errors, ... There have been discussions at the sTeX TRAC, see #147 and #173 there.

This is a relatively low-hanging fruit, we should probably just develop this in the arXMLiv branch and submit to Bruce.

dginev commented 14 years ago

So far I have fixed the typo and switched the message to XPointer style (at Mouth.pm, sub getLocator)

These changes are committed to the arXMLiv branch and are indeed trivial to implement.

I believe the "textrange" as opposed to "textpoint" locators should be implemented in the bindings as they are applicable only for environments and we can estimate them via the #locator and #trailer properties (as hinted in #94).

I also tried to investigate how one can find text ranges for regular command sequences, but I am yet to understand the Mouth acrobatics properly.

dginev commented 14 years ago

textrange has been implemented and is now the default locator message at Mouth.pm at the branch
I have handled environments in the sTeX bindings, via the numberIt and locateIt subroutines, more on that at the sTeX tickets.

@Bruce: There is the outstanding deficiency with getting the beginning of a command sequence, since the Mouth module has no book keeping on those. The problem is that each previous line could be of arbitrary length and it becomes quite impossible to track the start of the token unless we have saved it at the moment of reading (that's probably a good upgrade).

'''WARNING:''' The new XPointers that are produced at the branch are not 100% accurate as they are very prone to whitespace offsets. LaTeXML is cleaning a lot of white spaces while parsing and is not rigorous in keeping track of that while counting with some of the character variables, which leads to the occasional offset. And as I warned above, any command sequence written on multiple lines is guaranteed to be erroneously located at its start right now. The "to=" part of the range is pretty reliable.

kohlhase commented 14 years ago

this is very good news (for those people who are using the branch). Is there anything that you can say about the inaccuracies? E.g. do we know that the from point always under-estimates?

kohlhase commented 14 years ago

I think we should close this issue when we are satisfied with the testing and spin off the contents of WARNING as a separate issue for the (newly made) version "arXMLiv-branch".

dginev commented 14 years ago

The way I have implemented this currently, the "from" part of the pointer will:

Overestimate the start (range starts after the actual \command) with some multi-line commands, since we currently only look one line up and have no way of knowing the length of previous lines.
Underestimate the start (range starts before the actual \command) when there are extra whitespace/endofline deposited before the start of the command sequence.

The "to" part is quite accurate in principle, although Bruce had warned that it is also not completely reliable, though I am not sure whether the reasons are the whitespace offsets.

All of these can in principle be sharpened by carefully tuning the updates of the location variables at the different subroutines in Mouth.pm, maybe with a adding little infrastructure to trace line numbers.

brucemiller commented 14 years ago

I'm wondering if and what might be worth integrating into the core. I probably don't mind switching to a pointer based locator;

I'm less interested in explicitly adding locator attributes to the generated elements... but conceivably there could be a switch that would add them automatically to all created elements?

dginev commented 14 years ago

Replying to comment 10 @brucemiller:

I'm wondering if and what might be worth integrating into the core. I probably don't mind switching to a pointer based locator;

Yes, there seem to be a lot of nice applications that can build on such a scheme.

I'm less interested in explicitly adding locator attributes to the generated elements... but conceivably there could be a switch that would add them automatically to all created elements?

While introducing all kinds of "fancy" features on the branch, I started developing an allergy for new command-line --switches, simply because they become way too many to feasibly support (or even remember). However, it is true that one would prefer to be able to turn these things on and off. But if you give people choice, they start wanting more choice...

brucemiller commented 13 years ago

I looked at Mouth.pm in the arXMLiv branch, but I can't even figure out what the from part of textrange even is supposed to represent?

dginev commented 13 years ago

Forgive my hacky code :-) Ideally, for any \foo{something or the other} command, the from part would ideally match the position of the \, i.e. the beginning of the command sequence, while the to would try to match the end, just as you would intuitively think about it.

The code on the branch is currently trying to do some very weird bootstrapping and approximations, so that it is relatively accurate, as the locator mechanism was somehow flawed at places. Please let me know if you want me to elaborate in detail on the specific code snippets.

brucemiller commented 13 years ago

that's what I suspected, but.... well it doesn't do that! :> I'm looking at the diff between your Mouth.pm and mine. Basically, the block from lines 100--115. So $lstart is either the same line or the previous; OK. But $cstart is either $c-$nc or $nc-$c ? How would that point to the beginning of the token? What you're after is basically $c-1, unless it's a control sequence (or other special cases), then $c-.

The problem is that getLocator is simply telling you where the parser is; it says nothing about the previous token, let alone where the previous token started. So, really, you need to modify all entry points like readToken to record when they start reading a token.

dginev commented 13 years ago

then $c-<length of previous token>.

That is what the "$c - $nc" or "$nc - $c" are naively approximating in my code :> I know it is nowhere near correct, but it was the best early patch I could provide, so that there was something somewhat functional.

The problem is that getLocator is simply telling you where the parser _is_; it says nothing about the previous token, let alone where the previous token started. So, really, you need to modify all entry points like readToken to record when they _start_ reading a token.

Yes, I understand that. Which does not mean that I understand what the best way of implementing the changes is :> Would you be interested in trying to add this as a feature?

brucemiller commented 13 years ago

Maybe I'm just completely overlooking how this works, but this $c-$nc looks like random numbers to me. Can you give an example case where you get something meaningful?

And in any case, I don't see how this can give you the information that I suspect you're really looking for; I suspect this range is more ill-defined than you think it is. What kinds of objects do you want the "range" for, and how will you find the beginning pos of the 1st token and last pos of the last token that contributed to that object?

dginev commented 13 years ago

So, $c is where the parser IS, $nc is how many characters it read. Hence,

If $c>$nc then the reading began on this line, at position $c-$nc.
If $c<$nc, oops that's tricky. Reading started before this line, but when and where?
- In my code, I pretend it always starts on the previous line, at some bizarrely random point $nc-$c, which is completely wrong, but there for the lack of better ideas.

What kinds of objects do you want the "range" for,
 and how will you find the beginning pos of the 1st token and
 last pos of the last token that contributed to that object?

In principle, the more fine-grained - the better. And I have no idea how to find teh beginning tokens, unless I have a detailed record of what has been parsed and what lengths did the lines have. Hence my hacky code. I am not an expert in Mouth.pm, but if we add a bit more memory to the parser, we might be able to rework this to get some precise digits.

The type of objects we are interested in are naturally any and all command sequences.

brucemiller commented 13 years ago

$nc is the number of characters in the current line.

if $c > $nc then the last token that was read ended the line
if $c < $nc, then there are still more characters to read in the current line.

If you want to know the span for each token, then you'll need (at least) to modify Mouth->readToken to record the $$self{colno} (and maybe $$self{lineno}) in effect before ->getNextChar. Then you'll need to hope that those values are still meaningful by the time you ask for the locator. Or rather, the locator probably just tells you where the most recently read token started.

dginev commented 13 years ago

Well, if we can find where it started, know all the line lengths and know the token itself, we can calculate its end from its length. That was my original idea at least.

Sorry for misunderstanding what $nc represents.

brucemiller commented 13 years ago

Um, we're talking about TeX, here! :> The locator (for Mouth, at least) is essentially the point from which the next reading will take place.

You might think you can just take the locator before & after for the range, but watch out for places that TeX skips chars, eg. spaces at beginning of lines.

And you might think you can take the length of the token, but watch for the above, and places where chars are encoded with ^^

dginev commented 13 years ago

Hm, well why not do the bookkeeping after the chars are skipped? The final value of $nc had these things reflected if I remember correctly.

And I think an offset that falls into skipped material is something we can work with, so even getting there would be great.

brucemiller commented 13 years ago

You'll just have to read the code, the char skipping is often buried down a level or two which makes it hard to account for. I don't understand you're comment about $nc, since it was always just the length of the line.

Anyway, I can't help but thing this modification of locator will end up misguided. Currently, the locator records a point at or near the place of interest, so when it records the source position related to an xml construct, for example, since it is only a "point", it is clear that it is unclear.

With your proposed modification (as I imagine it), at most, the locator will evolve to "the span of the most recently read token". It will almost never be the actual span related to any construct of interest (which typically will involve multiple tokens), so it will just seem more confusing.

brucemiller commented 13 years ago

I know that this feature request sounds reasonable on the surface, and I can sympathise with the potential application, but it really seems ill-defined to me.

If the request it to change the syntax of the locator to XPointer's textpoint(line,col), that is easy enough and can be accommodated.

However, to turn it into a range needs to answer what range. At the Mouth level, the most you can do is the most recently read Token (which is easy enough to do in the application, anyway, so it isn't really worth carrying all that extra annotation around). Any higher level range seems poorly defined. The TeX to XML correspondence fails to be 1-1 often enough. And from an implementation point of view, you never know what thing you want, so that you can find where it started until it's way too late.

So... can we close this?

kohlhase commented 13 years ago

I think that Deyan made some implementation of this in the arXMLiv branch, which is accurate enough for me in sTeX to be extremely useful. I never encountered it being off; but Deyan says the sTeX code is regular enough. I know that Deyan wanted to improve things; And I would be very happy to have good back-references for the arXiv, since we could base author-oriented services on this.

From my side we can close this ticket.

brucemiller commented 13 years ago

Sorry that I'm a bit confused by the response; what you're wanting, how much, and why.

Ultimately, the locator is used in all the messages to give an indicator where a problem occurred, what triggered it, where things were defined, so they come in slightly different forms and are all "english" oriented. I think it would be relatively painful to debug latexml (as if it isn't painful enough already) trying to sort out xpointer expressions as well. And has been pointed out elsewhere, straying too far from the latex-way of doing things just causes more trouble! :>

If you're explicitly adding a locator to generated xml, then that could be done with only a bit of extra cruft. Put something like "..." and then define

  sub XPointer {
    my($locator)=@_;
    if($locator =~ /^at(.*?); line (\d+) col (\d+)$/){
      return "$1#textpoint($2,$3)"; }
    ...}

On the other hand, I've never understood the range notion; what object it should show the start and end of? (and how latexml should know) The example I saw that cropped up in Ticket #168 wasn't even "well-formed in a TeX sense" (ie the TeX wasn't balanced).

dginev commented 10 years ago

Well isn't this one confused ticket! So why not reopen it? :>

Motivation

Rereading Bruce's comments 4 years later makes them a LOT more understandable. Turns out experience helps grock the subject matter. So humour me in revisiting this discussion, hoping I now know enough to be worth talking to :> I definitely agree staying close to TeX is in the spirit of LaTeXML, so lets revisit the idea trying to do just that.

Let me start by reiterating that a good handle on locators is a really great capacity on which applications, in particular web editors, or any LaTeXML-driven editor, could build on. I want to give a running example of the feature we are all having in mind, implemented by WriteLaTeX.com for their good-old-TeX conversion workflow (the red rectangle holds a popup error message from the LaTeX log):

write latex localization

This example wasn't around when we had the discussion 4 years ago, so maybe we can all revisit our expectations and requirements to get such a feature for LaTeXML.

They are parsing the TeX log and using solely the line number for reporting the error. Clearly we can implement that exact error reporting interface without any modifications to the stock LaTeXML. That's great!

But we at KWARC always want to push the boundaries of features, so lets discuss doing so for localization, one last time. The enhancement we have always had in mind was to capture the exact surface substring (i.e. the TeX written in the document by the author, regardless of what it expanded into) which is "of interest" for some feature. In the case of error reporting, "of interest" means that it triggered an error.

Actually, while error reporting could probably be done quite effectively with the "single point" location given at the moment, the real challenge comes in when wanting to have "partial editing". That use case / feature description is:

Right click on a subsubsection (or any other environment / structural block) in the HTML
Easily obtain a locatalization range pointing back into the TeX source
Open for editing only the range pointed to by the locator range.
Perform your edit and efficiently save only that part of the document.

Now, whether or not this is a realistic feature (Wikipedia has it of course), it's clear having range locators is the foundation for it, but is only part of the magic that will be necessary. But this is the true motivation for using ranges for localization.

Technical details

Ok,Ideally, that meant 2 Cartesian coordinates - (start line, start column) and (end line, end column). And as you've mentioned above, realistically they can be recorded for any token read in by the Mouth. But (you continue to explain) interesting ranges in the XML most often correspond to multiple tokens. For example a theorem would need to point back to a range starting just before \begin{theorem} and ending just after \end{theorem}.

Clearly such a range can not be recorded by simply monitoring Mouth's readToken.

I think that point was something I never had thought about 4 years ago when the ticket was opened, and it is spot on and of very high relevance. So closing the ticket was justified at the point...

But I am wondering whether I now know just enough to be dangerous :> Say we want to commit to constructing such a range, for any XML element that we create. Then wouldn't we have a similar situation to doing the Box size arithmetic, but in this case it is Box locator arithmetic? The question is are there any pitfalls and when? Let me elaborate.

Naive Solution

The naive solution is doing the obvious:

atomic Whatsits will have the range as monitored by readToken in Mouth
- (which would need to be enhanced to monitor the start as well)
Composite Boxes will need to compose together the range locators of their children
- (and optionally their own, if any).
- The start of the range is the smallest of all start coordinates of the children Boxes
- The end of the range is the largest of all end coordinates of the children Boxes.

Could we resume the discussion from such a standpoint? I am still not proficient enough to immediately see the pitfalls, as I don't have the Box processing fully understood (but I think I am barely understanding it now! :>)

I am suggesting that the "location arithmetic" can be done on the Box end, since:

We can do the arithmetic for "tokens written by the author" in Mouth's readToken
As these tokens are digested into Boxes, we have the opportunity to track the location information for each Box.
Boxes almost directly correspond to XML elements when they get absorbed. (I am not feeling at all certain about this claim, however. Especially since there are a bunch of magic floating tricks even on the XML level).

I need to stop and ask for help and more education on this front, as I feel I am stretching the limits of my technical LaTeXML understanding.

Thanks for reading this treatise! :>

kohlhase commented 10 years ago

let me add my support to this re-enterprise, I think that good locator ranges are very very important for LaTeXML-based editing and change management applications.

I fear that there is nothing I can do to help on this, but I will appreciate any progress!

dginev commented 9 years ago

Not crucial for the next few releases, pushing back.

physikerwelt commented 9 years ago

I had discussed something similar with @HowardCohl We would could image to extend the parallel markup to the TeX string. For example the input $E=mc^2$ could be represented as

<math xmlns="http://www.w3.org/1998/Math/MathML" id="p1.1.m1.1" class="ltx_Math" alttext="E=mc^{2}" display="inline">
  <semantics id="p1.1.m1.1a">
    <mrow id="p1.1.m1.1.6" xref="p1.1.m1.1.6.cmml" srcrefBegin="1.1" srcrefEnd="1.6">
      <mi id="p1.1.m1.1.1" xref="p1.1.m1.1.1.cmml" srcrefBegin="1.1" srcrefEnd="1.1">E</mi>
      <mo id="p1.1.m1.1.2" xref="p1.1.m1.1.2.cmml" srcrefBegin="1.2" srcrefEnd="1.2">=</mo>
      <mrow id="p1.1.m1.1.6.1" xref="p1.1.m1.1.6.1.cmml" srcrefBegin="1.3" srcrefEnd="1.6">
        <mi id="p1.1.m1.1.3" xref="p1.1.m1.1.3.cmml" srcrefBegin="1.3" srcrefEnd="1.3">m</mi>
        <mo id="p1.1.m1.1.6.1.1" xref="p1.1.m1.1.6.1.1.cmml">&InvisibleTimes;</mo>
        <msup id="p1.1.m1.1.6.1.2" xref="p1.1.m1.1.6.1.2.cmml" srcrefBegin="1.4" srcrefEnd="1.6">
          <mi id="p1.1.m1.1.4" xref="p1.1.m1.1.4.cmml" srcrefBegin="1.5" srcrefEnd="1.6">c</mi>
          <mn id="p1.1.m1.1.5.1" xref="p1.1.m1.1.5.1.cmml" srcrefBegin="1.6" srcrefEnd="1.6">2</mn>
        </msup>
      </mrow>
    </mrow>
    <annotation-xml encoding="MathML-Content" id="p1.1.m1.1b">
      <apply id="p1.1.m1.1.6.cmml" xref="p1.1.m1.1.6">
        <eq id="p1.1.m1.1.2.cmml" xref="p1.1.m1.1.2"/>
        <ci id="p1.1.m1.1.1.cmml" xref="p1.1.m1.1.1">E</ci>
        <apply id="p1.1.m1.1.6.1.cmml" xref="p1.1.m1.1.6.1">
          <times id="p1.1.m1.1.6.1.1.cmml" xref="p1.1.m1.1.6.1.1"/>
          <ci id="p1.1.m1.1.3.cmml" xref="p1.1.m1.1.3">m</ci>
          <apply id="p1.1.m1.1.6.1.2.cmml" xref="p1.1.m1.1.6.1.2">
            <csymbol cd="ambiguous" id="p1.1.m1.1.6.1.2.1.cmml">superscript</csymbol>
            <ci id="p1.1.m1.1.4.cmml" xref="p1.1.m1.1.4">c</ci>
            <cn type="integer" id="p1.1.m1.1.5.1.cmml" xref="p1.1.m1.1.5.1">2</cn>
          </apply>
        </apply>
      </apply>
    </annotation-xml>
    <annotation encoding="application/x-tex" id="p1.1.m1.1c">E=mc^{2}</annotation>
  </semantics>
</math>

The notation srcrefBegin="1.1" srcrefEnd="1.6" is just an example. srcrefBegin should link to the beginning of the orginally typed expression by a human and srcrefEnd to the end of that expression. The format of the position is lineno.columnumber in this example.

dginev commented 9 years ago

Due to popular interest, pulling in to an earlier milestone :+1: If only it was that easy to actually solve things... Just FYI, this issue is still high on my personal priority list (but as everyone is aware, my LaTeXML time is quite limited nowadays).

brucemiller commented 9 years ago

You should note, however, that Moritz is asking for something subtly different. You wanted pointers into the original source, for editing round-tripping. Moritz wants pointers into the tex attribute, which isn't necessarily the same, although should have the same effect. However, maybe he'd be content with your original request, which seems doable (someday).

dginev commented 9 years ago

In the context of processing a single formula, which is what he's using LaTeXML for, the two are identical. So cookies for everyone :>

brucemiller commented 9 years ago

no, not necessarily; the tex attribute still could end up slightly different

dginev commented 9 years ago

You are of course correct, but I am daydreaming in the perfect world of #432 having been solved, and having the verbatim copy of the source formula in one of the MathML annotation elements. There's nothing wrong with having a cookie now and then :>

brucemiller commented 9 years ago

Alas, even solving #432 doesn't do it; the content form would lean towards the markup used for the content side of XMDuals, but even those might not be exactly the original input!

dginev commented 9 years ago

I am sure if we really wanted to preserve the original input in the MathML annotations (and I don't, at least not for the moment), and we solved this issue then:

Solving this issue implies LaTeXML has precise source locators of where the formula started and ended
LaTeXML has access to all source files
Combining the first two, the verbatim formula source is obtainable, if not through the TeX digestion flow, but from re-reading the source files at the ranges the locators point to, e.g. during MathML post-processing.

I should try to make cookie references on occasions where eating a cookie isn't this controversial :>

brucemiller commented 9 years ago

You're quite right that (having fixed #101), we could recover the original TeX string. However, that's not really what is wanted for the tex attribute: that string is used to generate images. In general (albeit maybe only rarely), the extract from the original TeX source can't be re-evaluated outside the context that it appeared (various definitions, registers, etc may affect it).

And the general warning still applies: If you give a mouse a cookie, he's going to want a glass of milk!

clange commented 9 years ago

Replying to @physikerwelt's post on parallel TeX/MathML markup. RFC 5147 defines a URI syntax for referring to ranges of text by character/line number.

brucemiller commented 9 years ago

Interesting; might be a cleaner alternative to XPointer, but it doesn't define ranges in the more "intuitive" from line, column to line, column, but rather you'd give a range as 2 char positions. Since char positions are octets, that gets even more into issues of encoding and linefeed counting. Still...thanks for the pointer! (pun!)

dginev commented 9 years ago

Returning to my original suggestion comment at: https://github.com/brucemiller/LaTeXML/issues/101#issuecomment-35291567

I am now working on a UI for LaTeXML errors in web editing and having a start-end range would be fantastic for the user experience. Many novice LaTeX users feel confused when deciphering the TeX error messages, so they need all the help they can get with making them intuitive.

kohlhase commented 9 years ago

Yes, I could not agree more. We are using the ranges in the error reporting for sTeX. Any progress on this would be very welcome.

brucemiller commented 9 years ago

I agree that this has the potential for a really useful feature, at least for editting automation, if somewhat less for users --- after all, LaTeXML has the same problem as TeX: by the time it knows something is wrong, it's too late to know what went wrong and where.

Also, there is a lot of confusion and complexity that you're overlooking. Firstly, the current locator is simply where the Gullet/Mouth is, ie where it will start reading the next time it wants to read a token. But this is not even the end of the token that it is currently processing: it often will have read ahead an arbitrary number of tokens! Tweaking the Mouth so that it notes where the last-read token started is a helpful step, but only if we are asking about the right token; presumably the token that invoked the current box/constructor.

The "right" way to do it, the only way I can imagine would guarantee you knew where things came from, would be to create tokens that record their own locator (begin & end, if you wish). Currently, tokens are immutable & reusuable, which ought to be a big perfomance boost, though I can't say that this has been benchmarked. I suspect that enhanced tokens would be a big performance hit, but might be wrong.

I'm not saying this feature won't be implemented (someday :> ), just hemming & hawing about the difficulties.

dginev commented 9 years ago

Sounds tricky but solvable. Thanks a lot for the details, I will take a stab at (efficiently) implementing something this week. It is still much easier than, say, rewriting a core LaTeXML module in C+XS :>

kohlhase commented 9 years ago

I think @m-iancu should know about this as well.

dginev commented 8 years ago

We have now received some interest in enabling this feature on Authorea, so I'm adding our golden label here. Maybe I'll have the luck to get some time to work on this again.

kohlhase commented 8 years ago

I am very happy to hear that there may be more progress on this. I think this is a very important base functionality of any converter for editing, change management, and automation.

brucemiller commented 8 years ago

I know if we make this work in the simple cases, soon folks will be wanting to make it work in the tricky cases, so it's worth thinking ahead.

My first observation is that the XML is created from boxes/whatsits that have been created from tokens taken from the gullet, which has taken them from the mouth. Even at the time of digestion, it's too late to ask the gullet ( which just asks the mouth) where it is, since in general, there's a lot of read-ahead and pushback. That's why my suggestion that the Token's themselves should record where they came from (making them no-longer immutable & reusable, but increasing them from 2 to 7 slots!) BUT, that raises a question for macros, where the tokens get replaced by tokens from (eg) some style file.

Perhaps the lookahead/pushback needs to be handled by Gullet adjusting somehow the current locator? Still not straightforward, since macro expansion just pushes the expansion back to be read next!

Ignoring that, it probably becomes more the responsibility of Stomach to record the locator before reading and processing the next token, then, whenever it has completed making some kind of object, get the locator again, which would be (approximately?) a start and end.

dginev commented 7 years ago

BUT, that raises a question for macros, where the tokens get replaced by tokens from (eg) some style file.

Actually, this in itself should be a sufficient reason to doubt that the tokens are the correct place to record locators. I still don't fully understand how complex a binding expansion may become, so that it becomes inaccurate to track the current insertion point in the source mouth.

Could you imagine and hand-waive an example where that would be a problem? I would love to get a complex (but minimal) TeX snippet that demonstrates the difficulties of locator bookkeeping, it can also become a new test case that ensures we keep anything new we invent working correctly.

kohlhase commented 7 years ago

I certainly do not understand all the timing issues in LaTeXML, but let me see whether we can clarify this. I will give a naive run-down of how I see the situation, and you can tell me where I err.

I would have thought that when we have tokenized a macro in the input, then we know where the source of this starts (s) and ends (e). And then can "put a box around" the result and give that the locator "from-a-to-e". Here "put a box around" is left deliberately vague, since our discussion should be independent of what we design here. If we have enough XML elements, we can just attach the locator on any of them, if we do not (e.g. if the macro expands to a text node inside another), we can add a span.
Example So if we have the input string

\def\foo{bar}
sdf\foo sdf

then a=2.4 and e=2.8 (line.char) and the LaTeXML output should be something like

<p locator="from-1.1-to-2.11">
sdf<span loctor="from-2.4-to-2.8">bar</span>sdf
</p>

Now, there are two problems here:

this is not going to be ''surjective'', i.e. there will be XML elements (or text passages) that are not located).
there are macros that eat arbitrary stuff, e.g. arguments but more as well.

For the surjectivity problem there should be an easy solution: E.g. if we have \def\foo{bar\baz bar} and the definition \def\baz{BAZ} is on line 99 in a file ../macros/foo.sty, then we would get

<p locator="from-1.1-to-2.11">
sdf<span loctor="from-2.4-to-2.8">bar<span locator="../macros/foo.sty#from-99.9-to-99.11">BAZ</span>bar</span>sdf
</p>

The eating disorder problem is probably the real one, since the LaTeX processing model is not recursive with a stack of function calls (which would make things simple). For instance, if we have

\def\foo{\bar}
\def\bar#1!{|#1*#1|}
sdf\foo bar!sdf

should expand to

<p locator="from-1.1-to-3.12">
sdf<span loctor="from-3.4-to-3.12">|bar*bar|</span>sdf
</p>

So I guess that the handling of \foo should delegate the handling of the locator to \bar once that is called. Are these the examples @dginev was looking for?

dginev commented 7 years ago

I'm actually thinking about a more convoluted example where, for instance, when an error is found inside the expansion of tokens that were never in the original input stream.

The primary use case I am holding here is reporting the error as precisely as possible inside an editor. The minimal requirement for that is that the locator range on the error message is accurate, whether that is recovered from the XML output or the log stream is identically OK. Good to have both.

So say someone defines some weird macro that uses \color from xcolor, but makes a typo and writes \colour:

\documentclass[]{article}
\usepackage{xcolor}
\begin{document}
\def\example[#1]_#2_{\colour{#1}#2}

\example[red]_{testing here}_

\end{document}

LaTeX's error here is (borrowed from Overleaf's editor):

Undefined control sequence.
\example [#1]_#2_->\colour 
                           {#1}#2
l.6 \example[red]_{testing here}_

The location information is "line 6", and as mentioned I think LaTeXML is already good enough to provide that. Since LaTeX is so programmable, it actually isn't clear to the typesetting engine that there is a macro typo, as it may alternatively be the case that there was a timing issue with a higher-level macro definition, which didn't define \colour in time, or alternatively it was defined in a wrong scope.

So there is nothing we can do for the user in terms of identifying the intention behind the error, but we should still be able to:

Point to the "invocation" of \example that latexml used, say line:col 6:1-6:30.
Possibly point to the location of the undefined macro's definition, since an author that needs to debug \colour should ideally be able to "jump" right to it. THat's line:col 4:22-4:29.

I think to achieve 1. tracking the position in the Mouth is sufficient. To achieve 2. however, tokens that are used as parts of definitions need to carry location information. Similarly, if tokens are created by latexml bindings, we should probably point to the location information in the Perl binding file. (e.g. LaTeX.pool line:col l1:c1-l2:c2).

I can't yet think of an example, where an error occurs in the main digestion pass of an article, and the resulting location report does not correspond to the location info of the last invocation build from tokens actually present in the source file.

dginev commented 7 years ago

Also, re:

The eating disorder problem is probably the real one, since the LaTeX processing model is not recursive with a stack of function calls (which would make things simple).

I think it is effectively a recursive model of expansion+digestion, but it is completely mutable and allows metaprogramming - one has very late binding for definitions, and it is possible to redefine anything up to the point where it is expanded.

That said, the actual location information of each individual input token is fixed at the start of the program, as it in any of many input resources (web resources, literal, .tex files) or definition files (.cls, .sty, .ltxml). So it does sound like the theoretically most powerful approach to record that information and propagate it through the pipeline until it reaches the result DOM. Maybe we should try it and measure the performance impact? Running make test + a tikz test, should already be strong indicators if it degrades significantly.

brucemiller / LaTeXML