errata-ai / vale

:pencil: A markup-aware linter for prose built with speed and extensibility in mind.
https://vale.sh
MIT License
4.3k stars 144 forks source link

Long line of Chinese characters in frontmatter causes crash #569

Closed iilyak closed 1 year ago

iilyak commented 1 year ago

Context

I wanted to write a script extension which makes sure the order of items in the numbered list (in the markdown document) is in the correct order. I've noticed odd behavior. The line numbers reported where pointing to the next line instead of the line which was added into matches using matches = append(matches, {begin: begin, end: end}). I had a theory that it could be caused by a presence of a front-mater. So I've decided to replicate the script from vale's own test suite https://github.com/errata-ai/vale/blob/5c5c0487bc4160db875f2601b4e133bab936594d/testdata/styles/Scripts/Test.yml and use it to repro the case.

Symptom

When I tried to reproduce I witness a crash. The crash didn't happen when I remove Chinese characters.

❯ cat styles/Test/Test.yml
extends: script
message: "Consider inserting a new section heading at this point."
link: https://tengolang.com/
# The unprocessed file contents.
#
# We need this to access heading markup.
scope: raw
script: |
  text := import("text")
  matches := []
  p_limit := 3 // at most 3 paragraphs per section
  // Remove frontmatter
  documentMinusFrontmatter := text.re_replace("(?s)^-{3}.*?-{3}", scope, "")
  // Remove all instances of code blocks since we don't want to count
  // inter-block newlines as a new paragraph.
  document := text.re_replace("(?s) *(\n```.*?```\n)", documentMinusFrontmatter, "")

  count := 0
  for line in text.split(document, "\n") {
    if text.has_prefix(line, "#") {
      count = 0 // New section; reset count
    } else if count > p_limit {
      start := text.index(scope, line)
      matches = append(matches, {begin: start, end: start + len(line)})
      count = 0
    } else if text.trim_space(line) == "" {
      count += 1
    }
  }

The document

❯ cat test/test.md
---
copyright:
  years: 2015, 2023
lastupdated: "2023-01-11"

keywords: 你好,你好,你好,你好,你好,你好,你好你好你好你好你好你好你好你好你好你好你好你好你好你好你好,你好,你好,你好,你好,你好,你好你好你好你好你好你好你好你好你好你好你好你好你好你好

subcollection: test

content-type: release-note
---

# Intro

The Flesch Reading Ease gives a text a score between 1 and 100, with 100 being
the highest readability score. Scoring between 70 to 80 is equivalent to school
grade level 8. This means text should be fairly easy for the average adult to
read.

The formula was developed in the 1940s by Rudolf Flesch. He was a consultant
with the Associated Press, developing methods for improving the readability of
newspapers.

Now, over 70 years later, the Flesch Reading Ease is used by marketers,
research communicators and policy writers, amongst many others. All use it to
help them assess the ease by which a piece of text will be understood and
engaged with.

foo bar baz

## Section

Here's another.

Yep, here too.

And again.

Yis is bad.

Stacktrace

vale test --minAlertLevel=warning
panic: runtime error: slice bounds out of range [:1054] with capacity 1024

goroutine 7 [running]:
github.com/errata-ai/vale/v2/internal/check.re2Loc(...)
        github.com/errata-ai/vale/v2/internal/check/definition.go:129
github.com/errata-ai/vale/v2/internal/check.makeAlert({{{0x0, 0x0}, {0x0, 0x0, 0x0}}, {0x0, 0x0}, {0x1400249bc00, 0x6}, {0x104d24966, ...}, ...}, ...)
        github.com/errata-ai/vale/v2/internal/check/definition.go:133 +0x19c
github.com/errata-ai/vale/v2/internal/check.Script.Run({{{{0x0, 0x0}, {0x0, 0x0, 0x0}}, {0x0, 0x0}, {0x1400249bc00, 0x6}, {0x104d24966, ...}, ...}, ...}, ...)
        github.com/errata-ai/vale/v2/internal/check/script.go:61 +0x538
github.com/errata-ai/vale/v2/internal/lint.(*Linter).lintBlock(0x14003599768, 0x140026ea160, {{0x14002d44000, 0x464}, 0xffffffffffffffff, {0x14002957640, 0x7}, {0x14002d44000, 0x464}}, 0x3c4?, ...)
        github.com/errata-ai/vale/v2/internal/lint/lint.go:229 +0x174
github.com/errata-ai/vale/v2/internal/lint.Linter.lintSizedScopes({{0x0, 0x0, 0x0}, {0x0, 0x0, 0x0}, 0x14002c45e40, 0x140024bec48, 0x10554e880, 0x1, ...}, ...)
        github.com/errata-ai/vale/v2/internal/lint/ast.go:193 +0x30c
github.com/errata-ai/vale/v2/internal/lint.Linter.lintHTMLTokens({{0x0, 0x0, 0x0}, {0x0, 0x0, 0x0}, 0x14002c45e40, 0x140024bec48, 0x10554e880, 0x1, ...}, ...)
        github.com/errata-ai/vale/v2/internal/lint/ast.go:147 +0x7e0
github.com/errata-ai/vale/v2/internal/lint.Linter.lintMarkdown({{0x0, 0x0, 0x0}, {0x0, 0x0, 0x0}, 0x14002c45e40, 0x140024bec48, 0x10554e880, 0x1, ...}, ...)
        github.com/errata-ai/vale/v2/internal/lint/md.go:60 +0x1c0
github.com/errata-ai/vale/v2/internal/lint.(*Linter).lintFile(0x1400003a190, {0x14002624020?, 0x1400261a2a0?})
        github.com/errata-ai/vale/v2/internal/lint/lint.go:169 +0x26c
github.com/errata-ai/vale/v2/internal/lint.(*Linter).lintFiles.func1.1.1({0x14002624020?, 0x1?})
        github.com/errata-ai/vale/v2/internal/lint/lint.go:114 +0x4c
created by github.com/errata-ai/vale/v2/internal/lint.(*Linter).lintFiles.func1.1
        github.com/errata-ai/vale/v2/internal/lint/lint.go:112 +0x1e8
iilyak commented 1 year ago

The line numbers reported where pointing to the next line instead of the line which was added into matches using matches = append(matches, {begin: begin, end: end}). I had a theory that it could be caused by a presence of a front-mater.

This problem is also caused by the presence of Chinese characters in the frontmatter. Once I removed the Chinese from frontmatter the vale reports correct line numbers.

iilyak commented 1 year ago

By looking at a code (in core/location.go, check/script.go) I don't see a place where loc reported from a script is adjusted to account for modifications to the text done inside the script. Which means if script modifies the text (by removing sections or codeblocks or anything else) the script itself is responsible for maintaining the correlation of line numbers in original text with modified text. Am I correct here?

iilyak commented 1 year ago

Since line numbers are not working and correct matches cause a crash (and I need to validate a big corpus of texts) I was trying to disable the match by providing bogus positions.

matches = append(matches, {begin: 0, end: 0})

I also tried

The result is the same. I observer a crash like the following every time Chinese characters are present.

panic: runtime error: slice bounds out of range [:4709] with capacity 4096

goroutine 7 [running]:
github.com/errata-ai/vale/v2/internal/check.re2Loc(...)
        github.com/errata-ai/vale/v2/internal/check/definition.go:129
github.com/errata-ai/vale/v2/internal/check.makeAlert({{{0x0, 0x0}, {0x0, 0x0, 0x0}}, {0x0, 0x0}, {0x14002f6d7b8, 0x6}, {0x1010e8966, ...}, ...}, ...)
        github.com/errata-ai/vale/v2/internal/check/definition.go:133 +0x19c
github.com/errata-ai/vale/v2/internal/check.Script.Run({{{{0x0, 0x0}, {0x0, 0x0, 0x0}}, {0x0, 0x0}, {0x14002f6d7b8, 0x6}, {0x1010e8966, ...}, ...}, ...}, ...)
        github.com/errata-ai/vale/v2/internal/check/script.go:61 +0x538
github.com/errata-ai/vale/v2/internal/lint.(*Linter).lintBlock(0x14003be5768, 0x14003fd0000, {{0x14005a10000, 0x1bef}, 0xffffffffffffffff, {0x140057ea6e8, 0x7}, {0x14005a10000, 0x1bef}}, 0x100e13bb8?, ...)
        github.com/errata-ai/vale/v2/internal/lint/lint.go:229 +0x174
github.com/errata-ai/vale/v2/internal/lint.Linter.lintSizedScopes({{0x0, 0x0, 0x0}, {0x0, 0x0, 0x0}, 0x14002da9240, 0x14003016d50, 0x101912880, 0x1, ...}, ...)
        github.com/errata-ai/vale/v2/internal/lint/ast.go:193 +0x30c
github.com/errata-ai/vale/v2/internal/lint.Linter.lintHTMLTokens({{0x0, 0x0, 0x0}, {0x0, 0x0, 0x0}, 0x14002da9240, 0x14003016d50, 0x101912880, 0x1, ...}, ...)
        github.com/errata-ai/vale/v2/internal/lint/ast.go:147 +0x7e0
github.com/errata-ai/vale/v2/internal/lint.Linter.lintMarkdown({{0x0, 0x0, 0x0}, {0x0, 0x0, 0x0}, 0x14002da9240, 0x14003016d50, 0x101912880, 0x1, ...}, ...)
        github.com/errata-ai/vale/v2/internal/lint/md.go:60 +0x1c0
github.com/errata-ai/vale/v2/internal/lint.(*Linter).lintFile(0x1400018e140, {0x14002a82360?, 0x0?})
        github.com/errata-ai/vale/v2/internal/lint/lint.go:169 +0x26c
github.com/errata-ai/vale/v2/internal/lint.(*Linter).lintFiles.func1.1.1({0x14002a82360?, 0x0?})
        github.com/errata-ai/vale/v2/internal/lint/lint.go:114 +0x4c
created by github.com/errata-ai/vale/v2/internal/lint.(*Linter).lintFiles.func1.1
        github.com/errata-ai/vale/v2/internal/lint/lint.go:112 +0x1e8
jdkato commented 1 year ago

By looking at a code (in core/location.go, check/script.go) I don't see a place where loc reported from a script is adjusted to account for modifications to the text done inside the script. Which means if script modifies the text (by removing sections or codeblocks or anything else) the script itself is responsible for maintaining the correlation of line numbers in original text with modified text. Am I correct here?

No, the script just has to report locations accurate to itself. In other words, script_text[start:end] needs to match what you intend the alert to be about.

Vale handles the location adjustments internally.