crate-ci / typos

Source code spell checker
Apache License 2.0
2.53k stars 96 forks source link

Support plain text .dic dictionary files #931

Open nyurik opened 7 months ago

nyurik commented 7 months ago

Many projects like Chromium use standard .dic files to list all "known" words, i.e. those words that should NOT be corrected. Is it possible to add support for this? Or is this something already supported (I couldn't find it in the readme or code search)

A .dic file is a simple text file with one word per line. I don't recall how capitalization is specified (i.e. must be exact, or it allows a lower-cased word in the .dic file to be in upper-case to be ignored, but not the other way around).

epage commented 7 months ago

A file of valid words is insufficient for typos because it doesn't coerce code to blessed words but instead a list of cursed words with blessed candidates.

nyurik commented 7 months ago

I'm not sure what that means, please elaborate

epage commented 7 months ago

See https://github.com/crate-ci/typos/blob/master/crates/typos-dict/assets/words.csv for our dictionary format we use at compile time.

nyurik commented 7 months ago

@epage thx, I understand about the conversion from "bad" to "good" words. What I don't understand is the workflow for the most typical use-case:

As such, the .dic files seem to be a perfect fit.

epage commented 7 months ago

Ok, I misunderstood. You aren't asking for us to treat this as a collection of words to correct to but as a list of words we shouldn't attempt to correct. Is that right?

nyurik commented 7 months ago

Ok, I misunderstood. You aren't asking for us to treat this as a collection of words to correct to but as a list of words we shouldn't attempt to correct. Is that right?

Exactly! Thanks :)

nyurik commented 7 months ago

P.S. And of course you may consider using these words to auto-correct INTO (e.g. if I have a custom foobar, and in my code I mistype it as fobar, you MAY want to autocorrect / suggest foobar as the "right" spelling)

epage commented 7 months ago

Is there a spec for this format?

Can you link to examples of where open source projects use these files with descriptions of how they are used?

nyurik commented 7 months ago

I am not certain there is an official "spec" similar to .csv (some variants, not perfectly standardized) -- i.e. it seems UTF-8 is a relatively "recent" change to it, while many programs still treat those files as being in their language own encoding (i.e. uses whatever common encoding was used for the language of the dictionary). A quick search showed these:

nyurik commented 7 months ago

P.S. I think this is the best documentation page I found: https://proofingtoolgui.org/proofingtoolgui_files/ProofingToolGUI_manual_V30.html

epage commented 7 months ago

Looks like .dic files are not standalone but require a .aff file to interpret them to get derived forms of words (different suffixes, prefixes).

At this point, I'm going to step back and restart the conversation. Can you describe the problem being addressed (.dic files are a solution), what your proposed solution is, and ideally prior art for that solution?

nyurik commented 7 months ago

My understanding was that .aff is "optional" - i.e. initially (from the old Lotus Notes days(?)), a .dic was a simple list of words, one word per line. Later, LibreOffice/hunspell expanded that to support optional <word>/<flag> notation. Those flags are for advanced usage, and may require additional .aff files. TBH, I never even heard of the .aff files until today - but I did see some .dic files stored in various projects a while back - as simple lists of words.

Now, to the main question of what I would like solved:

I would like to have a very easy, minimal no frills way to store custom list of words per project. I have done many PRs for big FOSS projects doing spell checking - e.g. using IntelliJ's spellchecking tool to go through the code. As part of that process, I often have thousands (!!!) of words that are custom to each project, and I have to go through them one by one, "accepting" them into the dictionary. This is an extremely tedious and boring task, and I would much rather have a tool to list all suspicious words into a plain text file, sort it, and quickly read through it to delete any words that are likely spelling mistakes. Whatever left is my new "project dictionary" - a file I can check into the project. The dictionary file should not have any structure because they are much easier to work with when they get fairly large -- no spaces or commas or quotes or escapes, no mandatory wrapping braces, easy to edit, easy to sort the whole file if needed, easy to diff between multiple files, easy to load it with libreoffice to do some multi-file meshes or lookups, etc.

P.S. A few times I had to even manually create this file out of the code by concatenating needed code files, replace all \s+ with \n, remove all [^a-zA-Z], and later converting this simple .dic-like file into a massively painful XML file that IntelliJ was using internally for its dictionary.

epage commented 7 months ago

Those flags are for advanced usage, and may require additional .aff files.

Looks like those are used by both your wooorm and LibreOffice links. This is an example of why I wanted to step back, to understand your request and how people today are using these files to fulfill your request to understand if you are asking for us to support LibreOffice dic files or if there are uses that are a common subset. It also didn't help that when i searched on my own for the referenced Chromium dic file, I accidentally ended up in a dict file which had a different format.

  • but I did see some .dic files stored in various projects a while back - as simple lists of words.

Would you be able to find those and link to them? I'd like to see how projects are using them in practice.

A part of all of this is that we have a way to define blessed words, so an important part of this is "why do we need something different". Prior art / meeting existing projects where they are at is important. This also helps guide discussions on auto-discovery vs specified paths in config, single or multiple files, etc.

P.S. A few times I had to even manually create this file out of the code by concatenating needed code files, replace all \s+ with \n, remove all [^a-zA-Z], and later converting this simple .dic-like file into a massively painful XML file that IntelliJ was using internally for its dictionary.

I wonder if typos --words would help :)

Speaking of, I assume we would want to support specifying these for both words and identifiers.

nyurik commented 7 months ago

Tokio project :) https://github.com/tokio-rs/tokio/blob/0fbde0e94b06536917b6686e996856a33aeb29ee/spellcheck.dic

nyurik commented 7 months ago

(I found it with a simple github search https://github.com/search?q=path%3A*.dic&type=code )

epage commented 7 months ago

Looks like tokio is using cargo spellcheck which seems aimed to support some of the more advanced features of .dic files, see https://github.com/drahnr/cargo-spellcheck/blob/master/docs/remedy.md#missing-word-variants

nyurik commented 7 months ago

Sure - advanced usages are always possible -- once the simple cases are solved. They mention /S to keep the dictionary small - a nice to have but not a big deal to add both cases - singular and plural - if needed.

ostr00000 commented 7 months ago

I can confirm that the good enough solution is to provide a file with known words.

My use case: In the code, there are used non-english "business" words. I already maintain a file with these valid words (it is in fact a .dic file). The singular and plural forms are not a problem (actually there are also dozens of grammar cases), because I can include these words several times if needed (in various grammar cases). Note that I do not use .aff file at all.

Lack of this feature prevent me to use this tool in pre-commit checks in some of our projects. Probably generating config in extend-words config field from .dic file would also solve my problem, but this would require to write a custom script. Instead, the ability to include a simple "known words" file is a much cleaner and convenient solution.

epage commented 6 months ago

For us to say we are supporting a format and then only supporting a fraction of it feels like it would be setting invalid expectations for users.

I looked around and not seeing other tools implement this. cspell only discusses it in passing in streetsidesoftware/cspell#4942

codespells makes no reference to a specific format but does have an "ignore file" with a line per word and a custom dictionary format

scspell uses a modified format with headers for saying what the "valid words apply to, e.g. their own dict

epage commented 6 months ago

With all of that said, the fact that we have native support for words makes this a lower priority for me resolving.

nyurik commented 6 months ago

@epage I understand your desire to have "ideal" solution (nothing wrong with that :) ) - my point of this ticket is that in my experience, the most common need is a plain text .dic files of word lists, not the fancier functionality with significantly higher barrier of entry. Please make it simple for the common usecase, and then eventually other usecases might also be implemented.

epage commented 6 months ago

I'm not shooting for an ideal; I just don't want a lie.

ostr00000 commented 6 months ago

With all of that said, the fact that we have native support for words makes this a lower priority for me resolving.

So the current workaround is to place .dic content in default.extend-words configuration (from docs: When the correction is the key, the word is always valid) - I am correct?

"ignore file" with a line per word

Would it be possible to extend configuration to accept a path to a such file? (I would like to not pollute my pyproject.toml with generated content)

I think the format itself is not so import and solution in codespell is what I am looking for. If it were possible to use any file, that is even better.
For example, I found that firefox uses .dat file for excluding custom valid words (persdict.dat):

nyurik commented 6 months ago

I agree, if you think .dic is too much of a promise, let's pick a different extension. Do note that I suspect most people are not even aware of the extra functionality beyond the simple word list -- I certainly was not before this discussion -- so I feel it would be more confusing to pick a new extension than to simply implement a subset of functionality, but whatever gets us going :)

ccoVeille commented 2 months ago

I'm also interested in the feature to be able to provide a list of words to ignore via a simple file (no matter the extension)

I would expect to be able to provide something like this via the .toml file

[files]
extend-ignore = ["ignore1.txt",".github/ignored.bar"]