asciidoctor / atom-language-asciidoc

⚛ AsciiDoc language package for the Atom editor.
https://atom.io/packages/language-asciidoc
MIT License
42 stars 20 forks source link

repair regex in #quote-paragraph macro #197

Open andrewcarver opened 2 years ago

andrewcarver commented 2 years ago

Description

The regex for the quote-paragraph rule is not as robust as it could be -- and also, either not as efficient as it could be, or else (if you're REALLY wanting to refer to captured groups) capturing one group incorrectly.

Here's the regex-segment that is in common between the regexes on lines 35 and 37 of the file ("quote-paragraph-grammar.cson"):

^\\[(quote|verse)((?:,|#|\\.|%)([^,\\]]+))*\\]$

The robustness problem is that as it stands, the regex will not match any block-attribute-list that contains, itself, an inline macro -- such as a URL macro (see screen-shot) or a bibliographic-citation macro (see the below syntax example). And no, it doesn't help, if you escape the closing bracket of the inline macro.

This problem is located in the character-group in the regex: In order to get the character-class not to run away with the rest of the quote-block, there was included in it the closing ']' of the block-attribute-list; but one could have achieved the same thing better by making the quantifier "lazy":

instead of ([^,\\]]+) one could just write ([^,]+?)

With that change, the regex matches also these inline-macro-containing quote-block-attribute-lists (see screen shots).

As for the other problem: If you don't really need to capture any groups (i.e., you're not gong to refer to them, as is not done inside the present macro), then the regex will run more efficiently if the capturing groups are made non-capturing groups.

On the other hand, if we are wanting to capture those groups, then we need to fix the second capturing group:

((?:,|#|\\.|%)([^,\\]]+))*

Putting a quantifier on a capturing group, like this, leads to the problem which "Regex Buddy" explains thus:

You repeated the capturing group itself. The group will capture only the last iteration. Put a capturing group around the repeated group to capture all iterations. «*» Or, if you don’t want to capture anything, replace the capturing group with a non-capturing group to make your regex more efficient.

I took the first, safer route for this problem, in my code-fix, in case we do want to capture:

((?:(?:,|#|\\.|%)([^,]+?))*)

If capturing is superfluous, though, then we can just drop the outermost (the capturing) group -- like so:

(?:(?:,|#|\\.|%)([^,]+?))*

Syntax example

[quote, Louis Berkhof, 'cite:[ Berkhof1975, prefix= "see ", page= 29]']
____
[.nonsuccessor-para]
The study of the History of Dogma as a separate discipline is of a comparatively recent date. ... Since the Church of Rome proceeded on the assumption, and still maintains the position, that dogma is unchangeable, it may be said that the Reformation by breaking with that view opened the way for a critical treatment of the history of dogma. Moreover, [Protestantism] was a movement which, in its very nature, was well calculated to furnish a special incentive for such a study. It raised many questions respecting the nature of the Church and her teachings, and sought to answer these not only in the light of Scripture but also with an appeal to the Fathers of the early Church, thus furnishing a direct and powerful motive for a [critical] historical study of dogma.
____

Screenshots

Example-1_Screen-with-OLD-regex Example-1_Screen-with-NEW-regex

andrewcarver commented 2 years ago

P.S. On those screenshots, my themes-configuration was:

themes: [ "atom-light-ui" "one-dark-syntax" ]

andrewcarver commented 2 years ago

P.P.S. I was relying on my screenshot filenames to identiy what was what -- not realizing the filenames would get changed!

So: The first screenshot is the view with the OLD regex; the next screenshot is the view with the NEW regex.

ldez commented 2 years ago

Hello,

as we have an issue with our CI, can you send the output of the tests launch?

andrewcarver commented 2 years ago

Sure: I ran the tests on both of the versions I suggested above -- the one with the outer-wrapping capturing group, and the one without.

The results were the same for both, except that with the capturing group included, there was an extra "deprecated warning". Here are the full results, including that extra warning:

(node:17976) Electron: Loading non-context-aware native module in renderer: '\\?\C:\Users\acarver\AppData\Local\atom\app-1.57.0\resources\app.asar.unpacked\node_modules\pathwatcher\build\Release\pathwatcher.node'. This is deprecated, see https://github.com/electron/electron/issues/18397.
(node:17976) Electron: Loading non-context-aware native module in renderer: '\\?\C:\Users\acarver\AppData\Local\atom\app-1.57.0\resources\app.asar.unpacked\node_modules\oniguruma\build\Release\onig_scanner.node'. This is deprecated, see https://github.com/electron/electron/issues/18397.
(node:17976) Electron: Loading non-context-aware native module in renderer: '\\?\C:\Users\acarver\AppData\Local\atom\app-1.57.0\resources\app.asar.unpacked\node_modules\superstring\build\Release\superstring.node'. This is deprecated, see https://github.com/electron/electron/issues/18397.
(node:17976) Electron: Loading non-context-aware native module in renderer: '\\?\C:\Users\acarver\AppData\Local\atom\app-1.57.0\resources\app.asar.unpacked\node_modules\tree-sitter\build\Release\tree_sitter_runtime_binding.node'. This is deprecated, see https://github.com/electron/electron/issues/18397.
(node:17976) Electron: Loading non-context-aware native module in renderer: '\\?\C:\Users\acarver\AppData\Local\atom\app-1.57.0\resources\app.asar.unpacked\node_modules\@atom\nsfw\build\Release\nsfw.node'. This is deprecated, see https://github.com/electron/electron/issues/18397.
(node:17976) Electron: Loading non-context-aware native module in renderer: '\\?\C:\Users\acarver\AppData\Local\atom\app-1.57.0\resources\app.asar.unpacked\node_modules\git-utils\build\Release\git.node'. This is deprecated, see https://github.com/electron/electron/issues/18397.
(node:17976) Electron: Loading non-context-aware native module in renderer: '\\?\C:\Users\acarver\AppData\Local\atom\app-1.57.0\resources\app.asar.unpacked\node_modules\scrollbar-style\build\Release\scrollbar-style-observer.node'. This is deprecated, see https://github.com/electron/electron/issues/18397.
(node:17976) [DEP0005] DeprecationWarning: Buffer() is deprecated due to security and usability issues. Please use the Buffer.alloc(), Buffer.allocUnsafe(), or Buffer.from() methods instead.
..............................................................................................................................................................................................................................................................................................................................................................................

Finished in 20.741 seconds
366 tests, 2483 assertions, 0 failures, 0 skipped

The warning included here but NOT included when I tested without the extra capturing group, is that last one:

(node:17976) [DEP0005] DeprecationWarning: Buffer() is deprecated due to security and usability issues. Please use the Buffer.alloc(), Buffer.allocUnsafe(), or Buffer.from() methods instead.

If we're not really wanting to capture the match inside this group, then the above warning seems more than enough reason to omit that extra capturing group. (We might even want to make ALL the remaining groups non-capturing.) Please let me know, if you'd like me to change my PR code to omit that group.

I'd like to add a tangential comment: My earlier screenshots had attribute-list-highlighting that was ... slightly incorrect. This was only because of the styling, not because of the markup: I was running, at the time, a custom grammar package of my own that 'include'd the main grammar files from your package (with just the regex tweaked); but I'd forgotten, as yet, to copy over your 'styles' folder!

For the above tests, though, I edited my installed copy of your original package (doing it the proper way -- by editing the quote-paragraph grammar-file and re-generating the main file). So, here's a better screenshot, which I took at that time, showing the correct attribute-list-coloring:

Example-1_Screen-with-NEW-regex-(and INCLUDING-the-LESS-file)

If you look close, you might notice that, within the block-attribute-list, the inline macro's opening bracket is marked up like the text surrounding it, but its closing bracket is not (separately) marked up at all, but appears as just text, like the opening and closing bracket of the block-attribute-list. I missed this anomaly at first: you have to look pretty close, even to spot it. So, I'm not worried about it -- it's a perfectly acceptable price to pay, IMHO (unless we can think of a neat fix for it).

andrewcarver commented 2 years ago

P.S. One way to fix that little anomaly might be to use a recursive regex -- which Ruby started supporting in v. 2.0, it seems.

ldez commented 2 years ago

coffeescript is not ruby, it's not the same regex support.

andrewcarver commented 2 years ago

Hmm. Well, as far as I'm aware, JavaScript (must less CoffeeScript?) supports neither recursive regex nor regex with subroutine calls, which would ba another way to approach this ...

andrewcarver commented 2 years ago

Silly idea anyway, I guess: The problem is not with the regex we've fixed, but with the action downstream -- and specifically, I think, the "block-attribute-inner" macro, which our macro "include"s.

andrewcarver commented 2 years ago

I've suggested two changes in a regex in the file "repositories/partials/block-attribute-inner-grammar.cson" -- an alternate branch with a look-ahead, and a clear-typo fix. Also, I removed that outer, capturing group that seemed extraneous in the earlier regex-fix.

WIth those regex-fixes in place, I got the following test-run:

(node:18632) Electron: Loading non-context-aware native module in renderer: '\\?\C:\Users\acarver\AppData\Local\atom\app-1.57.0\resources\app.asar.unpacked\node_modules\pathwatcher\build\Release\pathwatcher.node'. This is deprecated, see https://github.com/electron/electron/issues/18397.
(node:18632) Electron: Loading non-context-aware native module in renderer: '\\?\C:\Users\acarver\AppData\Local\atom\app-1.57.0\resources\app.asar.unpacked\node_modules\oniguruma\build\Release\onig_scanner.node'. This is deprecated, see https://github.com/electron/electron/issues/18397.
(node:18632) Electron: Loading non-context-aware native module in renderer: '\\?\C:\Users\acarver\AppData\Local\atom\app-1.57.0\resources\app.asar.unpacked\node_modules\superstring\build\Release\superstring.node'. This is deprecated, see https://github.com/electron/electron/issues/18397.
(node:18632) Electron: Loading non-context-aware native module in renderer: '\\?\C:\Users\acarver\AppData\Local\atom\app-1.57.0\resources\app.asar.unpacked\node_modules\tree-sitter\build\Release\tree_sitter_runtime_binding.node'. This is deprecated, see https://github.com/electron/electron/issues/18397.
(node:18632) Electron: Loading non-context-aware native module in renderer: '\\?\C:\Users\acarver\AppData\Local\atom\app-1.57.0\resources\app.asar.unpacked\node_modules\@atom\nsfw\build\Release\nsfw.node'. This is deprecated, see https://github.com/electron/electron/issues/18397.
(node:18632) Electron: Loading non-context-aware native module in renderer: '\\?\C:\Users\acarver\AppData\Local\atom\app-1.57.0\resources\app.asar.unpacked\node_modules\git-utils\build\Release\git.node'. This is deprecated, see https://github.com/electron/electron/issues/18397.
(node:18632) Electron: Loading non-context-aware native module in renderer: '\\?\C:\Users\acarver\AppData\Local\atom\app-1.57.0\resources\app.asar.unpacked\node_modules\scrollbar-style\build\Release\scrollbar-style-observer.node'. This is deprecated, see https://github.com/electron/electron/issues/18397.
..............................................................................................................................................................................................................................................................................................................................................................................

Finished in 17.457 seconds
366 tests, 2483 assertions, 0 failures, 0 skipped

I may be wrong, but it seems to me that these regex-fixes may solve the problem!

Example-1_Screen-with-NEW-regexes-for-#QUOTE-PARAGRAPH-and-#BLOCK-ATTRIBUTE-INNER

andrewcarver commented 2 years ago

Hmm ... on second thought, that look-ahead is not QUITE robust enough: it could get fooled by an apostrophe (Fred\'s car) that's not actually closing the attribute-value!

So maybe something like this would do it:

\\](?=[^',.#%]*'\\s*(,|\\]$))

I gave that a quick trial run, and it seems to work (on my limited sample):

Example-1_Screen-with-NEW-regexes-for-#QUOTE-PARAGRAPH-and-#BLOCK-ATTRIBUTE-INNER_and_latter's_look-ahead_beefed-up

Haven't yet run the test again, though.

Thoughts?

andrewcarver commented 2 years ago

OK, one more suggestion for beefing up the look-ahead, then I shut up:

How about we also check, that any character just before the apostrophe (but, like the apostrophe, subsequent to the \\] !) is NOT a backslash (\\\\)?

\\](?=[^',.#%]*?[^\\\\]?'\\s*(,|\\]$))

On a quick test, this also seems to work...

So with that, the whole regex becomes: (?<=\\{|,|\\.|#|\"|'|%)((?:[^\\],.#%]|\\](?=[^',.#%]*?[^\\\\]?'\\s*(,|\\]$)))+)

Here's how Regex Buddy analyzes this regex (but please note that in Regex Buddy, on a Windows, CR/LF machine, just before any $ I have to put \r*, to make the regex work):

andrewcarver commented 2 years ago

Well, it didn't take me long to come up with an example breaking that.

But WAIT A SECOND -- maybe the reason I'm having to work so hard to "beef up" the lookahead, is that I'm "looking ahead" for a too-limited set of things:

Instead of checking to make sure my \\] is followed by an apostrophe (of a certain class), don't I want to accept and match the \\] if it's followed (on the same line) by ANYTHING that's not a space-character AND not the end of the line -- that is, if the \\] is not the one closing the attribute-list??

If so, then this look-ahead should solve the problem:

(?=[ \\t\\f]*[^\\s])

-- which puts us here, for the complete regex:

(?<=\\{|,|\\.|#|\"|'|%)((?:[^\\],.#%]|\\](?=[ \\t\\f]*[^\\s]))+)

andrewcarver commented 2 years ago

Test results for this last commit:

(node:17956) Electron: Loading non-context-aware native module in renderer: '\\?\C:\Users\acarver\AppData\Local\atom\app-1.57.0\resources\app.asar.unpacked\node_modules\pathwatcher\build\Release\pathwatcher.node'. This is deprecated, see https://github.com/electron/electron/issues/18397.
(node:17956) Electron: Loading non-context-aware native module in renderer: '\\?\C:\Users\acarver\AppData\Local\atom\app-1.57.0\resources\app.asar.unpacked\node_modules\oniguruma\build\Release\onig_scanner.node'. This is deprecated, see https://github.com/electron/electron/issues/18397.
(node:17956) Electron: Loading non-context-aware native module in renderer: '\\?\C:\Users\acarver\AppData\Local\atom\app-1.57.0\resources\app.asar.unpacked\node_modules\superstring\build\Release\superstring.node'. This is deprecated, see https://github.com/electron/electron/issues/18397.
(node:17956) Electron: Loading non-context-aware native module in renderer: '\\?\C:\Users\acarver\AppData\Local\atom\app-1.57.0\resources\app.asar.unpacked\node_modules\tree-sitter\build\Release\tree_sitter_runtime_binding.node'. This is deprecated, see https://github.com/electron/electron/issues/18397.
(node:17956) Electron: Loading non-context-aware native module in renderer: '\\?\C:\Users\acarver\AppData\Local\atom\app-1.57.0\resources\app.asar.unpacked\node_modules\@atom\nsfw\build\Release\nsfw.node'. This is deprecated, see https://github.com/electron/electron/issues/18397.
(node:17956) Electron: Loading non-context-aware native module in renderer: '\\?\C:\Users\acarver\AppData\Local\atom\app-1.57.0\resources\app.asar.unpacked\node_modules\git-utils\build\Release\git.node'. This is deprecated, see https://github.com/electron/electron/issues/18397.
(node:17956) Electron: Loading non-context-aware native module in renderer: '\\?\C:\Users\acarver\AppData\Local\atom\app-1.57.0\resources\app.asar.unpacked\node_modules\scrollbar-style\build\Release\scrollbar-style-observer.node'. This is deprecated, see https://github.com/electron/electron/issues/18397.
..............................................................................................................................................................................................................................................................................................................................................................................

Finished in 17.022 seconds
366 tests, 2483 assertions, 0 failures, 0 skipped

Screen-shot is same as before.

andrewcarver commented 2 years ago

It might be worth mentioning, that the pattern "alternate-branch-with-lookahead" that I applied to the last regex (to allow it to match a contained, inline macro's closing bracket) affords us an alternative fix for the first regex, as well -- one which would avoid lazy quantification (with the back-tracking it introduces).

Here's the current state of the regex-replacement we've substituted into #quote-paragraph -- followed here by the alternate we could use, avoiding lazy quantification:

^\\[(quote|verse)(?:(?:,|#|\\.|%)([^,]+?))*\\]$ ^\\[(quote|verse)(?:(?:,|#|\\.|%)([^,\\]]|\\](?=[ \\t\\f]*\\S))+)*\\]$

In trying both of them, my eyes didn't notice any difference in efficiency. But it MIGHT be that a lLITTLE less risk of runaway backtracking attaches to the latter regex ... ??

andrewcarver commented 2 years ago

Just in case Regex Buddy's verbalization of that last regex would be helpful

Created with RegexBuddy

mojavelinux commented 2 years ago

I'm fine with giving this a try, and I appreciate the detailed analysis.

I have one question, though. What is \f in there for? As far as I know, that's not a character AsciiDoc should ever be considering. The only valid space-like characters in AsciiDoc are space and tab (i.e., [ \\t]).

ggrossetie commented 2 years ago

as we have an issue with our CI, can you send the output of the tests launch?

@ldez we should probably migrate to GitHub Action, any objection?

andrewcarver commented 2 years ago

I have one question, though. What is \f in there for? As far as I know, that's not a character AsciiDoc should ever be considering.

@mojavelinux Lol. Yes, it's a bit silly in this context, isn't it? I'd just noticed that it's included in the whitespace category \s and I threw it in for completeness. But I'm happy to omit it.

andrewcarver commented 2 years ago

Just in case anyone's intrigued, at all, by my last suggestion:

... that the pattern "alternate-branch-with-lookahead" that I applied to the last regex ... affords us an alternative fix for the first regex, as well -- one which would avoid lazy quantification (with the back-tracking it introduces)

I ran the specs on it:

(node:19432) Electron: Loading non-context-aware native module in renderer: '\\?\C:\Users\acarver\AppData\Local\atom\app-1.57.0\resources\app.asar.unpacked\node_modules\pathwatcher\build\Release\pathwatcher.node'. This is deprecated, see https://github.com/electron/electron/issues/18397. (node:19432) Electron: Loading non-context-aware native module in renderer: '\\?\C:\Users\acarver\AppData\Local\atom\app-1.57.0\resources\app.asar.unpacked\node_modules\oniguruma\build\Release\onig_scanner.node'. This is deprecated, see https://github.com/electron/electron/issues/18397. (node:19432) Electron: Loading non-context-aware native module in renderer: '\\?\C:\Users\acarver\AppData\Local\atom\app-1.57.0\resources\app.asar.unpacked\node_modules\superstring\build\Release\superstring.node'. This is deprecated, see https://github.com/electron/electron/issues/18397. (node:19432) Electron: Loading non-context-aware native module in renderer: '\\?\C:\Users\acarver\AppData\Local\atom\app-1.57.0\resources\app.asar.unpacked\node_modules\tree-sitter\build\Release\tree_sitter_runtime_binding.node'. This is deprecated, see https://github.com/electron/electron/issues/18397. (node:19432) Electron: Loading non-context-aware native module in renderer: '\\?\C:\Users\acarver\AppData\Local\atom\app-1.57.0\resources\app.asar.unpacked\node_modules\@atom\nsfw\build\Release\nsfw.node'. This is deprecated, see https://github.com/electron/electron/issues/18397. (node:19432) Electron: Loading non-context-aware native module in renderer: '\\?\C:\Users\acarver\AppData\Local\atom\app-1.57.0\resources\app.asar.unpacked\node_modules\git-utils\build\Release\git.node'. This is deprecated, see https://github.com/electron/electron/issues/18397. (node:19432) Electron: Loading non-context-aware native module in renderer: '\\?\C:\Users\acarver\AppData\Local\atom\app-1.57.0\resources\app.asar.unpacked\node_modules\scrollbar-style\build\Release\scrollbar-style-observer.node'. This is deprecated, see https://github.com/electron/electron/issues/18397. ..............................................................................................................................................................................................................................................................................................................................................................................

Finished in 16.644 seconds 366 tests, 2483 assertions, 0 failures, 0 skipped

The main arguments I can see for it (without doing some kind of efficiency-analysis above my pay-grade) are that it makes the regex-code somewhat easier to understand:

  1. this "alternate-branch-with-lookahead" pattern is a little more self-explanatory, as to what the regex-change is trying to achieve;
  2. RE-using that pattern puts both regex-changes "in parallel", so to speak, which shows, also, that the two regexes are alike in what they're trying to achieve
mojavelinux commented 2 years ago

I'm definitely intrigued. And that sounds like great news. Is there anything else you think needs to be solved before this PR is considered ready?

andrewcarver commented 2 years ago

Nope. I'll go ahead and make that last, "intriguing" change, then -- in a little while ... unless you say "hold up" ...

andrewcarver commented 2 years ago

It all looks ready, now, to me ...

andrewcarver commented 2 years ago

Is there anything else you think needs to be solved before this PR is considered ready?

@mojavelinux I replied in the negative; but now I've got to back-pedal:

Someone please correct me, if I'm wrong: but I don't believe we need the capturing group that I let stand in the regex for block-attribute-inner-grammar.cson (nor is there need to replace it with a non-capturing group). After all, the only match to which the code there pays attention is (I think) the overall match (match 0):

{ comment: "attributes" name: "markup.meta.attribute-list.asciidoc" match: "(?<=\\{|,|\\.|#|\"|'|%)((?:[^\\],.#%]|\\](?=[ \\t]*\\S))+)" captures: "0": patterns: [ { include: "#attribute-reference" } ] }

Any reason we shouldn't delete that capturing group -- for efficiency?

andrewcarver commented 2 years ago

F.y.i., the spec-run on this deletion:

C:\Users\acarver\.atom\packages\language-asciidoc>atom --test spec

(node:252) Electron: Loading non-context-aware native module in renderer: '\\?\C:\Users\acarver\AppData\Local\atom\app-1.57.0\resources\app.asar.unpacked\node_modules\pathwatcher\build\Release\pathwatcher.node'. This is deprecated, see https://github.com/electron/electron/issues/18397. (node:252) Electron: Loading non-context-aware native module in renderer: '\\?\C:\Users\acarver\AppData\Local\atom\app-1.57.0\resources\app.asar.unpacked\node_modules\oniguruma\build\Release\onig_scanner.node'. This is deprecated, see https://github.com/electron/electron/issues/18397. (node:252) Electron: Loading non-context-aware native module in renderer: '\\?\C:\Users\acarver\AppData\Local\atom\app-1.57.0\resources\app.asar.unpacked\node_modules\superstring\build\Release\superstring.node'. This is deprecated, see https://github.com/electron/electron/issues/18397. (node:252) Electron: Loading non-context-aware native module in renderer: '\\?\C:\Users\acarver\AppData\Local\atom\app-1.57.0\resources\app.asar.unpacked\node_modules\tree-sitter\build\Release\tree_sitter_runtime_binding.node'. This is deprecated, see https://github.com/electron/electron/issues/18397. (node:252) Electron: Loading non-context-aware native module in renderer: '\\?\C:\Users\acarver\AppData\Local\atom\app-1.57.0\resources\app.asar.unpacked\node_modules\@atom\nsfw\build\Release\nsfw.node'. This is deprecated, see https://github.com/electron/electron/issues/18397. (node:252) Electron: Loading non-context-aware native module in renderer: '\\?\C:\Users\acarver\AppData\Local\atom\app-1.57.0\resources\app.asar.unpacked\node_modules\git-utils\build\Release\git.node'. This is deprecated, see https://github.com/electron/electron/issues/18397. (node:252) Electron: Loading non-context-aware native module in renderer: '\\?\C:\Users\acarver\AppData\Local\atom\app-1.57.0\resources\app.asar.unpacked\node_modules\scrollbar-style\build\Release\scrollbar-style-observer.node'. This is deprecated, see https://github.com/electron/electron/issues/18397. ..............................................................................................................................................................................................................................................................................................................................................................................

Finished in 21.714 seconds 366 tests, 2483 assertions, 0 failures, 0 skipped

mojavelinux commented 2 years ago

I'm happy to proceed. However, I'm not the one who merges to this repo.