errata-ai / vale

:pencil: A markup-aware linter for prose built with speed and extensibility in mind.
https://vale.sh
MIT License
4.45k stars 153 forks source link

Spellcheck is testing substrings of URL #185

Closed amyq closed 4 years ago

amyq commented 4 years ago

After upgrading from 1.7.1 to 2.1.0, Vale is splitting the contents of URLs apart on word boundaries and testing the individual words in spell-check. I've been reading over the spelling style description to see if there were any scope parameters we've been missing, and I don't think there are.

The branch is docs-aqualls-vale-spelling, and the Spelling.yml file contains:

extends: spelling
message: 'Spelling check: "%s"?'
level: warning
ignore:
  - gitlab/spelling-exceptions.txt

(Note that in the branch it's warning but I'm modifying it locally to error so the problems are easier to see.)

Here's the command I run, and the results:

$ vale --no-wrap --minAlertLevel error doc/administration/troubleshooting/gitlab_rails_cheat_sheet.md

 doc/administration/troubleshooting/gitlab_rails_cheat_sheet.md
 288:82  error  Spelling check: "dev"?     gitlab.Spelling
 675:29  error  Spelling check: "gitlab"?  gitlab.Spelling
 675:70  error  Spelling check: "ee"?      gitlab.Spelling
 693:38  error  Spelling check: "gitlab"?  gitlab.Spelling
 736:5   error  Spelling check: "ee"?      gitlab.Spelling

When I look at gitlab_rails_cheat_sheet.md:

# Line 675
Features listed in <https://gitlab.com/gitlab-org/gitlab/blob/master/ee/app/models/license.rb>.
# Line 693
From [Zendesk ticket #91083](https://gitlab.zendesk.com/agent/tickets/91083) (internal)

At first I thought it was the <…> format of the links, but I've tried converting them over to the more standard Markdown syntax of [link word](url) and I'm still getting the same issue.

Ideas?

jdkato commented 4 years ago

The <…> links are the culprit here: Vale lints the title of a link (the part enclosed in [...]) but not the link itself ((...)). However, a link like

<https://gitlab.com/gitlab-org/gitlab/blob/master/ee/app/models/license.rb>

is shorthand for

[https://gitlab.com/gitlab-org/gitlab/blob/master/ee/app/models/license.rb](https://gitlab.com/gitlab-org/gitlab/blob/master/ee/app/models/license.rb)
amyq commented 4 years ago

I totally believe you - I discovered something odd just now that's related. If I only fix one of the links in gitlab_rails_cheat_sheet.md and re-saved the file, all the errors remain. I have to fix all occurrences of this style of link on the page for all of the errors to disappear at once. It's definitely odd behavior.

jdkato commented 4 years ago

I can't reproduce the behavior you're describing. And since the linting results for <...> are expected, I'm going to consider this resolved.

If you feel I've missed something, though, feel free to comment further.

janisz commented 4 years ago

I have the same error. If md file contains HTML tags (not only links) inside vale parse links and produce errors

e.g.:

<p>The quick brown [fox](https://fox.com) jumps over the lazy dog</p>

results with

 1:26  error  Did you really mean 'https'?  Vale.Spelling 
janisz commented 4 years ago

https://github.com/errata-ai/vale/issues/231

NicolasMassart commented 4 years ago

I also have the issue when writing global links labels like:

[id]: https://example.com/

Vale returns error Did you really mean 'https'? Vale.Spelling

However, there's no alternative possible syntax here for what I know. So this looks like a Vale bug to me, it happens, but it's not an expected result in this case I think.

jdkato commented 4 years ago

@NicolasMassart: Can you share a file that exhibits this behavior?

Something like

# Link Definitions

In this paragraph, we use our [link definition][id].

[id]: https://example.com/

Doesn't reproduce what you're seeing for me (v2.4.0).

NicolasMassart commented 4 years ago

@jdkato thanks for your help on this one. Here is the exact full file content I have issues with (no secret, it's a public repos). I tried running Vale only with this file too to make sure.

# EthSigner documentation [![Documentation Status](https://readthedocs.org/projects/ethsigner/badge/?version=stable)](https://docs.ethsigner.pegasys.tech/en/stable/?badge=stable)

[EthSigner] is a transaction signing application to be used with a web3 provider.

The software sources are hosted in https://github.com/PegaSysEng/ethsigner

This repository only contains the sources for [EthSigner documentation website hosted by ReadTheDocs].

This repository uses a Git submodule. Please refer to the [common tools wiki] for explanation about
how to build and contribute to this documentation.

[EthSigner]: https://github.com/PegaSysEng/ethsigner
[common tools wiki]: https://github.com/PegaSysEng/doc.common/wiki
[EthSigner documentation website hosted by ReadTheDocs]: https://docs.ethsigner.pegasys.tech/

the full error output is:

 README.md

 3:55   warning  'be used' may be passive        write-good.Passive 

                 voice. Use active voice if you                     

                 can.                                               

 5:22   warning  'are hosted' may be passive     write-good.Passive 

                 voice. Use active voice if you                     

                 can.                                               

 13:22  error    Did you really mean 'https'?    Vale.Spelling      

 14:71  error    Use 'EthSigner' instead of      Vale.Terms         

                 'ethsigner'.

Note for the Vale.Terms one I have the EthSigner word in the vocab accept.txt file, but it should not check for the word in the link. Also I tested with different ways to write the links refs in the text: [EthSigner], [EthSigner][] all give the same error. Let me know if you need more detail. Thanks.

jdkato commented 4 years ago

The problem is line 5:

The software sources are hosted in https://github.com/PegaSysEng/ethsigner

The "link" here is syntactically just text, which is going to be linted. You could write it as something along the lines of

The software sources are hosted in [`https://github.com/PegaSysEng/ethsigner`][EthSigner]

which won't be linted since the link text is a code span.

NicolasMassart commented 4 years ago

Interesting, the issues were reported on lines 13 and 14, so I thought it would be the links in the bottom. We currently use a markdown plugin that enables to use bare urls like that and we configured the linter to ignore bare urls. Is there a way for Vale to detect that some strings inside the text are actually URLs? Otherwise it seems that using Vale prevents to use this kind of syntax even if all other tools provides a way to do it. Could we only use <http://link> instead of link?

NicolasMassart commented 4 years ago

Ok, with this it works now. I removed the ability to create "bare urls" on this project. Just a last interesting question for you though: why the reported line numbers where wrong? Thanks for your help.

prcr commented 3 years ago

I'm also having some trouble with Vale reporting issues in URLs for Markdown links. On the README.md of my project:

[...]

## What is Codacy

[Codacy](https://www.codacy.com/) is an automated code review tool that monitors your technical debt, helps you improve your code quality, teaches best practices to your developers, and helps you save time in code reviews.

[...]

When I run Vale:

$ vale README.md 

 README.md
[...]
 19:4   error    Use ''what's'' instead of       Microsoft.Contractions 
                 'What is'.  
 21:10  error    Did you really mean 'https'?    Vale.Spelling          
 21:22  error    Use 'Codacy' instead of         Vale.Terms             
                 'codacy'.                                              
[...]

✖ 4 errors, 1 warning and 0 suggestions in 1 file.

All other issues are reported correctly.

jdkato commented 3 years ago

@pauloribeiro-codacy: see https://github.com/errata-ai/vale/issues/185#issuecomment-610000763.

prcr commented 3 years ago

Thanks @jdkato, as soon as I removed the "bare link" at the top of the file, Vale stopped reporting the error on line 21. :raised_hands:

But that's what got me confused, it seems that when using bare links Vale does not report the correct line numbers where it finds issues:

image

I had seen this behavior before on other files and I got quite confused because it seemed "flaky", but now I understand and can do a workaround to avoid the spurious errors.