Open nyurik opened 9 months ago
A file of valid words is insufficient for typos
because it doesn't coerce code to blessed words but instead a list of cursed words with blessed candidates.
I'm not sure what that means, please elaborate
See https://github.com/crate-ci/typos/blob/master/crates/typos-dict/assets/words.csv for our dictionary format we use at compile time.
@epage thx, I understand about the conversion from "bad" to "good" words. What I don't understand is the workflow for the most typical use-case:
As such, the .dic files seem to be a perfect fit.
Ok, I misunderstood. You aren't asking for us to treat this as a collection of words to correct to but as a list of words we shouldn't attempt to correct. Is that right?
Ok, I misunderstood. You aren't asking for us to treat this as a collection of words to correct to but as a list of words we shouldn't attempt to correct. Is that right?
Exactly! Thanks :)
P.S. And of course you may consider using these words to auto-correct INTO (e.g. if I have a custom foobar
, and in my code I mistype it as fobar
, you MAY want to autocorrect / suggest foobar
as the "right" spelling)
Is there a spec for this format?
Can you link to examples of where open source projects use these files with descriptions of how they are used?
I am not certain there is an official "spec" similar to .csv (some variants, not perfectly standardized) -- i.e. it seems UTF-8 is a relatively "recent" change to it, while many programs still treat those files as being in their language own encoding (i.e. uses whatever common encoding was used for the language of the dictionary). A quick search showed these:
P.S. I think this is the best documentation page I found: https://proofingtoolgui.org/proofingtoolgui_files/ProofingToolGUI_manual_V30.html
Looks like .dic
files are not standalone but require a .aff
file to interpret them to get derived forms of words (different suffixes, prefixes).
At this point, I'm going to step back and restart the conversation. Can you describe the problem being addressed (.dic
files are a solution), what your proposed solution is, and ideally prior art for that solution?
My understanding was that .aff
is "optional" - i.e. initially (from the old Lotus Notes days(?)), a .dic
was a simple list of words, one word per line. Later, LibreOffice/hunspell expanded that to support optional <word>/<flag>
notation. Those flags are for advanced usage, and may require additional .aff
files. TBH, I never even heard of the .aff
files until today - but I did see some .dic
files stored in various projects a while back - as simple lists of words.
Now, to the main question of what I would like solved:
I would like to have a very easy, minimal no frills way to store custom list of words per project. I have done many PRs for big FOSS projects doing spell checking - e.g. using IntelliJ's spellchecking tool to go through the code. As part of that process, I often have thousands (!!!) of words that are custom to each project, and I have to go through them one by one, "accepting" them into the dictionary. This is an extremely tedious and boring task, and I would much rather have a tool to list all suspicious words into a plain text file, sort it, and quickly read through it to delete any words that are likely spelling mistakes. Whatever left is my new "project dictionary" - a file I can check into the project. The dictionary file should not have any structure because they are much easier to work with when they get fairly large -- no spaces or commas or quotes or escapes, no mandatory wrapping braces, easy to edit, easy to sort the whole file if needed, easy to diff between multiple files, easy to load it with libreoffice to do some multi-file meshes or lookups, etc.
P.S. A few times I had to even manually create this file out of the code by concatenating needed code files, replace all \s+
with \n
, remove all [^a-zA-Z]
, and later converting this simple .dic
-like file into a massively painful XML file that IntelliJ was using internally for its dictionary.
Those flags are for advanced usage, and may require additional .aff files.
Looks like those are used by both your wooorm and LibreOffice links. This is an example of why I wanted to step back, to understand your request and how people today are using these files to fulfill your request to understand if you are asking for us to support LibreOffice dic files or if there are uses that are a common subset. It also didn't help that when i searched on my own for the referenced Chromium dic file, I accidentally ended up in a dict file which had a different format.
- but I did see some .dic files stored in various projects a while back - as simple lists of words.
Would you be able to find those and link to them? I'd like to see how projects are using them in practice.
A part of all of this is that we have a way to define blessed words, so an important part of this is "why do we need something different". Prior art / meeting existing projects where they are at is important. This also helps guide discussions on auto-discovery vs specified paths in config, single or multiple files, etc.
P.S. A few times I had to even manually create this file out of the code by concatenating needed code files, replace all \s+ with \n, remove all [^a-zA-Z], and later converting this simple .dic-like file into a massively painful XML file that IntelliJ was using internally for its dictionary.
I wonder if typos --words
would help :)
Speaking of, I assume we would want to support specifying these for both words and identifiers.
(I found it with a simple github search https://github.com/search?q=path%3A*.dic&type=code )
Looks like tokio is using cargo spellcheck
which seems aimed to support some of the more advanced features of .dic
files, see https://github.com/drahnr/cargo-spellcheck/blob/master/docs/remedy.md#missing-word-variants
Sure - advanced usages are always possible -- once the simple cases are solved. They mention /S
to keep the dictionary small - a nice to have but not a big deal to add both cases - singular and plural - if needed.
I can confirm that the good enough solution is to provide a file with known words.
My use case: In the code, there are used non-english "business" words. I already maintain a file with these valid words (it is in fact a .dic
file). The singular and plural forms are not a problem (actually there are also dozens of grammar cases), because I can include these words several times if needed (in various grammar cases). Note that I do not use .aff
file at all.
Lack of this feature prevent me to use this tool in pre-commit checks in some of our projects. Probably generating config in extend-words
config field from .dic
file would also solve my problem, but this would require to write a custom script. Instead, the ability to include a simple "known words" file is a much cleaner and convenient solution.
For us to say we are supporting a format and then only supporting a fraction of it feels like it would be setting invalid expectations for users.
I looked around and not seeing other tools implement this. cspell only discusses it in passing in streetsidesoftware/cspell#4942
codespells makes no reference to a specific format but does have an "ignore file" with a line per word and a custom dictionary format
scspell uses a modified format with headers for saying what the "valid words apply to, e.g. their own dict
With all of that said, the fact that we have native support for words makes this a lower priority for me resolving.
@epage I understand your desire to have "ideal" solution (nothing wrong with that :) ) - my point of this ticket is that in my experience, the most common need is a plain text .dic
files of word lists, not the fancier functionality with significantly higher barrier of entry. Please make it simple for the common usecase, and then eventually other usecases might also be implemented.
I'm not shooting for an ideal; I just don't want a lie.
With all of that said, the fact that we have native support for words makes this a lower priority for me resolving.
So the current workaround is to place .dic
content in default.extend-words
configuration (from docs: When the correction is the key, the word is always valid
) - I am correct?
Would it be possible to extend configuration to accept a path to a such file? (I would like to not pollute my pyproject.toml
with generated content)
I think the format itself is not so import and solution in codespell
is what I am looking for. If it were possible to use any file, that is even better.
For example, I found that firefox
uses .dat
file for excluding custom valid words (persdict.dat
):
I agree, if you think .dic
is too much of a promise, let's pick a different extension. Do note that I suspect most people are not even aware of the extra functionality beyond the simple word list -- I certainly was not before this discussion -- so I feel it would be more confusing to pick a new extension than to simply implement a subset of functionality, but whatever gets us going :)
I'm also interested in the feature to be able to provide a list of words to ignore via a simple file (no matter the extension)
I would expect to be able to provide something like this via the .toml file
[files]
extend-ignore = ["ignore1.txt",".github/ignored.bar"]
Many projects like Chromium use standard
.dic
files to list all "known" words, i.e. those words that should NOT be corrected. Is it possible to add support for this? Or is this something already supported (I couldn't find it in the readme or code search)A .dic file is a simple text file with one word per line. I don't recall how capitalization is specified (i.e. must be exact, or it allows a lower-cased word in the .dic file to be in upper-case to be ignored, but not the other way around).