facelessuser / pyspelling

Spell checker automation tool
https://facelessuser.github.io/pyspelling/
MIT License
80 stars 21 forks source link

Ignoring words containing numbers, hyphens, etc. #130

Closed miff2000 closed 3 years ago

miff2000 commented 3 years ago

I have markdown files with all sorts of acronyms and technical words in them. I've added those special words to the dictionary, and that's worked for most, but with terms like L2TP, IPv6, etc. they seem to have the numbers removed from them before they are spell checked.

Is there any way I can control this? Ideally I would like to ignore any words in their entirety if they are in the custom dictionary

Below shows my current config. You'll see I've that I've added pattern exclusions for 1st, 2nd, etc. and pre-, post-, etc. to tackle those.

matrix:
- name: Markdown
  default_encoding: utf-8
  aspell:
    lang: en
    d: en_GB
  dictionary:
    encoding: utf-8
    wordlists:
    - .wordlist.txt
  pipeline:
  - pyspelling.filters.url:
  - pyspelling.filters.context:
      context_visible_first: true
      escapes: '\\[\\`~]'
      delimiters:
      # Ignore multiline content between fences (fences can have 3 or more back ticks)
      - open: '(?s)^(?P<open> *`{3,})\S+$'
        close: '^(?P=open)$'
      # Ignore text between inline back ticks
      - open: '(?P<open>(`|``))+'
        close: '(?P=open)'
      # Ignore everything after :keywords:
      - open: '(:keywords:)+'
        close: '$'
      # Ignore pre-, post- prefixes
      - open: '\W(Pre|pre|Post|post|Un|un)'
        close: '-'
      # Ignore 1st, 2nd, 3rd, etc.
      - open: '\W[0-9]+(?:st|[nr]d|th)'
        close: '\W'
      # Ignore bare URLs, surrounded in <>
      - open: '\W<\w+:\/\/'
        close: '>\W'
  - pyspelling.filters.markdown:
  - pyspelling.filters.html:
      comments: false
      ignores:
      - code
      - pre
      - a href
  sources:
  # - 'source/**/*.md'
facelessuser commented 3 years ago

I'm assuming you are using Aspell?

miff2000 commented 3 years ago

Yes, Aspell in my case

On Sat, 2 Jan 2021, 18:54 Isaac Muse, notifications@github.com wrote:

I'm assuming you are using Aspell?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/facelessuser/pyspelling/issues/130#issuecomment-753515346, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA244LPWJ4OXIKFDFWCXZ5TSX5TYBANCNFSM4VRGLOFA .

facelessuser commented 3 years ago

Let me look into it. If there is a way, it is via Aspell options. I need to double check if I've whitelisted relevant options or not (assuming there is one).

miff2000 commented 3 years ago

Amazing! Thank you! I did try taking a look myself but I just kept finding articles from 2013 from Aspell forums which didn't help much

On Sat, 2 Jan 2021, 19:07 Isaac Muse, notifications@github.com wrote:

Let me look into it. If there is a way, it is via Aspell options. I need to double check if I've whitelisted relevant options or not (assuming there is one).

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/facelessuser/pyspelling/issues/130#issuecomment-753516756, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA244LJ6ISLZEBJLMTSDBH3SX5VGPANCNFSM4VRGLOFA .

facelessuser commented 3 years ago

So, here is the issue. Currently, it is outlined here: http://aspell.net/0.60.7/man-html/Words-With-Symbols-in-Them.html.

I'll post the relevant info though:

Numbers in words present a different challenge to Aspell. If Aspell treats numbers as letters than every possible number a user might write in a document must be specified in the dictionary. This could be easily be solved by having special code to assume all numbers are correctly spelled. But what about something like "4th". Since the "th" suffix can appear after any number we are left with the same problem. The solution would be to have a special symbol for "any number".

So, this just illustrates they don't really have a solution. They word break on numbers, so something like IPv6 breaks the word on the number, and you get a mispell of IPv.

I didn't find any options in Aspell (already allowed in pyspelling or currently not not allowed) that really works around the number issue.

I would consider doing something like this and use the HTML filter to filter out the specific tags.

<span class='nospell'>IPv6</span> other **Markdown**

Then ignore span.nospell.

or maybe

<nospell>IPv6</nospell>

Then ignore nospell.

The only other alternative is to maybe just add IPv to your personal dictionary. These are currently the main choices available with Aspell.

There may be alternatives in Hunspell, but I'm usually using Aspell.

facelessuser commented 3 years ago

I'm going to move this to discussions as it isn't really a bug in pyspelling, there isn't really anything we can do as this is specific to the spellchecker being used, but I'm certain I'll be asked this again 🙂.

facelessuser commented 3 years ago

There may be a way to handle hyphens though...let me check.

(Seems it won't let me transfer the issue to "Discussions" yet).

miff2000 commented 3 years ago

That's great. They sound like perfectly good workarounds to me 😊

Thanks for looking into this, and thanks for the great software!

On Sat, 2 Jan 2021, 19:22 Isaac Muse, notifications@github.com wrote:

I'm going to move this to discussions as it isn't really a bug in pyspelling, there isn't really anything we can do as this is specific to the spellchecker being used, but I'm certain I'll be asked this again 🙂.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/facelessuser/pyspelling/issues/130#issuecomment-753518500, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA244LLNZSZWNGISEHY5EMTSX5W7PANCNFSM4VRGLOFA .

facelessuser commented 3 years ago

As far as hyphens, this looks to be under Aspell's TODO (compound words):

Things that need to be done These items need to be done before I consider Aspell finished. If you are interested in helping me with one of these tasks please email me. Good C++ skills are needed for most of these tasks involving coding.

Support Hunspell features that Aspell doesn't have which prove to be usefull. Most likely: Twofold suffix stripping Better support for compound words. The support for conditional compound words found in Aspell versions 0.50 and earlier is no longer available since no one seams to be using it. Support for unconditional compound words will still be available. I have some ideas on the topic available here, but perhapes something compatable with how Hunspell does it will be better. Maybe others.

facelessuser commented 3 years ago

Ugh, GitHub allowed converting issues to discussions, but for some reason, they removed it... 😑

Well, I'll just close this as answered.