Support non-breaking-space in subscripts

chtenb commented 1 year ago

Origin: https://github.com/asciidoctor/asciidoctor/issues/3951

Would it be possible to support the non-breaking-space character in a subscript? I'm in a situation where I make heavy use of subscripts with spaces, and I want to make my source text look as close to the intentional rendering as possible. Having {sp} littered all over the text conflicts with this idea.

However, I have no problem using unicode in my text, and since a non-breaking-space looks fine in plain text too, I figure that would be a nice feature to have, and solve my usecase.

Asciidoctor seems to already support this, but asciidoctor.js does not.

mojavelinux commented 1 year ago

Please provide an example of what does not work. Asciidoctor and Asciidoctor.js use exactly the same code. Unless there is somehow a problem with the transpiler (which I doubt), I expect it to work the same way in both.

chtenb commented 1 year ago

Here is an example

Test~a1 a2~

Note that the space between a1 and a2 is a non-breaking space.

mojavelinux commented 1 year ago

Much to my surprise, there does seem to be a difference in behavior between Asciidoctor and Asciidoctor.js here. It looks like there's a mismatch between the interpretation of the regular expression character class \S.

Ruby:

'\u00a0'.match? /\S/
# => true

/\S/.test('\u00a0')
// => false

Thus, Asciidoctor.js would need to add a check for this character to match the behavior of Asciidoctor.

In AsciiDoc, no-break space should not be considered a space character. (In Ruby, \S is defined as [^ \t\r\n\f\v]. Though we don't expect to find \r, \f, and \v in a document—at least not where markup is interpreted—so it's effectively [^ \t\r\n]).

chtenb commented 1 year ago

According to the JS documentation at

https://developer.mozilla.org/en-US/docs/Web/JavaScript/Guide/Regular_Expressions/Character_Classes

\S is defined as [^\f\n\r\t\v\u0020\u00a0\u1680\u2000-\u200a\u2028\u2029\u202f\u205f\u3000\ufeff].

chtenb commented 1 year ago

Perhaps the asciidoc code could refrain from using the \S and \s classes at all, and instead use explicit character classes?

mojavelinux commented 1 year ago

We plan on defining spacing characters more clearly in the specification. To solve the issue at hand, I would focus on just the regexp in question.

chtenb commented 1 year ago

I'm happy to provide a PR, but I'm unfamiliar with this codebase. Do you have any pointers to where this regexp is located?

mojavelinux commented 1 year ago

It's a complex issue because the regexp is defined in Asciidoctor Ruby. Anytime there are differences in the interpretation of regexp, it requires adding a condition in Asciidoctor Ruby that tells the transpiler which alternative to use. It's not a small change and requires coordination between Guillaume and I. And right now, we're both very busy with other things.

You say you have to literal your document with {sp} without this. Can you actually provide a defensible use case for when subscript and superscript require spaces? One of the key reasons the language doesn't allow spaces (or some other way to express it) is because I've never been able to find a common case when it's truly needed.

chtenb commented 1 year ago

It's a complex issue because the regexp is defined in Asciidoctor Ruby.

Wouldn't it simply be a matter of replacing \S with [^ \t\r\n] in the regexp in question in the Ruby source? That seems like a portable way to do it.

Can you actually provide a defensible use case for when subscript and superscript require spaces?

I've started using asciidoc for some simple mathematics that I want to write in my texteditor. I explicitly do not want to use Latex (and the stem block for that matter), because I want the notation to work well in plaintext, such that it is easily readable from your texteditor and shareable across text-only channels, without looking clunky. Hence I use unicode in combination with some table blocks and super/subscript.

See this for an example: https://chtenb.dev/?page=cat With the plain text source being: https://raw.githubusercontent.com/chtenb/chtenb.github.io/master/blog/cat.adoc

You could argue that asciidoc is not the right tool for the job, but it seems to cover my needs quite well at the moment, except for this little issue. Moreover, there does not seem to be an alternative that suits this usecase better.

mojavelinux commented 1 year ago

I will argue that AsciiDoc is not the right tool for this job. This is asking the syntax to do what it was not designed to do.

No change is a simple one. It requires writing tests and making releases, none of which I have time for right now. My focus is on developing the first draft of the specification, and will be for the better part of this year. You are free to patch Asciidoctor for your own personal use if you decide you can't live without this requested change.

chtenb commented 1 year ago

This is asking the syntax to do what it was not designed to do.

I'm not sure what you mean by this, since you stated that the current behavior of asciidoctor.js regarding non-breaking spaces and subscripts is in fact a bug. It's hard to defend against authors/maintainers saying you're using their tool wrong. But it's also not very useful if there is no obvious alternative :)

I completely understand other people don't have time to solve my issues, which is why I offered to provide a PR. But if the codebase/project infrastructure is too complex to handle for a layman this is not a viable course of action.

Mainaining a separate fork is too much of a hassle for me. I will implement a workaround that replaces non-breaking-spaces with some uncommon unicode character before passing the asciidoc source to asciidoctor.js, and reverse the replacement in the generated html. That seems to work well enough.

Thanks so far for the quick replies and the swift diagnosis!

mojavelinux commented 1 year ago

since you stated that the current behavior of asciidoctor.js regarding non-breaking spaces and subscripts is in fact a bug.

First of all, I stated that it's a difference from Asciidoctor in regard to whether you can make use of a workaround. Since AsciiDoc is ambiguous about what accounts for a spacing characters right now, it doesn't warrant calling it a bug. It's an idiosyncrasy at best. (The very type of idiosyncrasy we are working to address in the specification).

Now that I understand what you're trying to use the workaround for, I don't agree it's the right thing to do. Non-break space has a different meaning than space in the layout and it thus changes how the text is arranged. I had only suggested it as a quick workaround that you could use in Asciidoctor Ruby, but since there's an unspecified difference in Asciidoctor.js, that workaround is not available there. If you need a space in superscript or subscript, it must be written as {sp} to be portable.

In AsciiDoc, superscript and subscript are not designed to be used for elaborate STEM expressions. That's not what the language was designed to do. They are for simple uses of superscript and subscript (such as H₂O) and for notes, such as text^{citation needed}). Anything beyond that warrants the use of the STEM support / macro.

I will keep this discussion in mind when working on the specification, but I'm not going to spend any more time on this issue now because, as it stands, it's unspecified behavior. Asciidoctor Ruby and Asciidoctor.js are the way that are and are passing the tests we wrote for the specified behavior.

mojavelinux commented 1 year ago

Mainaining a separate fork is too much of a hassle for me. I will implement a workaround that replaces non-breaking-spaces with some uncommon unicode character before passing the asciidoc source to asciidoctor.js, and reverse the replacement in the generated html. That seems to work well enough.

:+1:

asciidoctor / asciidoctor.js

Support non-breaking-space in subscripts #1688