bytesparadise / libasciidoc

A Golang library for processing Asciidoc files.
Apache License 2.0
199 stars 24 forks source link

Constrained quoted text not fully constrained #630

Open gdamore opened 4 years ago

gdamore commented 4 years ago

An undocumented feature of asciidoctor is that constrained text that is followed by a punctuation must not be followed by an alphanumeric.

That is this:

`unbalanced `?stuff

Is not considered to contain <code> elements, but the following is:

`unbalanced `? stuff

The pattern for constrained text must limit what it matches so that it does not match punctuation if that punctuation is immediately followed by an alphanumeric.

Note that following instead with another punctuation is fine. For example:

`unbalanced`?#stuff

renders as a code element followed by literal ?#stuff

gdamore commented 4 years ago

I think I'll probably fix this bug ahead of #622 --- I ran into this while testing edge cases, and I think fixing it first will give more satisfactory results.

gdamore commented 4 years ago

I've started reading the asciidoctor syntax for this -- and it's eye opening. It's a mess of regular expressions indeed, and seems to recursively process them.

It may be difficult to achieve perfect parity with asciidoctor, given this. There are probably some constructs which don't really map well to PEG grammars, and vice versa. I think these are at the very edge of the parsing -- honestly mostly cases where if a user is depending on the interpretation he can make changes to make it unambiguous.

I don't think any of these short-comings will affect typical use cases, and it may actually be misguided to get too wrapped around the axle on perfect compatibility as I believe the syntax will likely evolve a little bit once other folks try to formalize the grammar using EBNF or something like that.

gdamore commented 4 years ago

So I've become almost completely convinced that its going to be near impossible to make PEG grammar that obeys the precise encoding that asciidoctor seems to use -- at least not without going through multiple passes, which is of dubious merit.

The issue remains coping with the conditions of constraint, and in particular the cases before hitting constraint. We can make some fairly complex rules that cover most of the cases, but it gets really hard to cover them all because of some things that asciidoctor does.

For example, consider:

"`something`"`here`

Our parser would prefer to render this (eliding the <div> and <p> elements):

&#8220;something&#8221;<code>here</code>

Asciidoctor instead renders this as:

&#8220;something&#8221;`here`

The reason is astonishing. It doesn't match against the higher level quoted string element. Instead it sees a semicolon at the trail of the second HTML entity, and has a specific rule which causes that to fail. The substitution of the smart quotes was performed in an earlier step, and it is doing pattern matching on the results of that earlier pass!

Conversely, using

*something*`here`

Leaves you with reasonably expected:

<strong>something</strong><code>here</code>

Why? Because </strong> doesn't end in the prohibited semicolon. I consider the results to be border-line insane -- trying to rationalize what asciidoctor does and what it considers a "word" boundary is challenging, to say the least.

I expect that until we get a formal grammar, we might be best off just doing what we do, which isn't entirely unreasonable if not precisely compatible.

Document authors frustrated with this should probably take more effort to avoid the edges of the rules around constraints. Using unconstrained markup is certainly much safer.

I think even the asciidoctor authors recognize that unconstrained markup has been the source of many problems with other formats like markdown, so this might not even be that much of a battle with the time comes to standardize.

At any rate, I'm kind of exhausted trying to solve this problem -- enough so that for me I'd rather just avoid this edge case in my source documents. I'm unassigning myself, and taking out of the 0.5.0 milestone for now.

This ticket should probably remain open as a reminder, and as a guide to anyone else who struggles with the same issue.

xcoulon commented 4 years ago

thanks for all the time you spent on this issue @gdamore! I would suggest closing the issue. Users who deal with this problem could find this issue and see all your investigation.