codespell-project / codespell

check code for common misspellings
GNU General Public License v2.0
1.88k stars 468 forks source link

Can we skip files which end in ".pem/.crt" #2135

Open clickthisnick opened 2 years ago

clickthisnick commented 2 years ago

What are people's thought on skipping files that end with ".pem" and ".crt" so that certificates and things like that don't get false flagged on accident?

vikivivi commented 2 years ago

You might want to see "skip" in https://github.com/codespell-project/codespell/blob/master/README.rst

clickthisnick commented 2 years ago

ya that's what we are doing - didn't know if the community thought it would be okay to default skip without that explicitly set tho

peternewman commented 2 years ago

If I look at some random .pem and .crt files, some do have some plain English in them too, although mostly just the example ones. Is there some reason they shouldn't be scanned automatically?

Also what's it tripping up on them, two letter character combinations? Can we resolve it by just moving them to the code dictionary?

clickthisnick commented 2 years ago

ya its a bunch of 2/3 letters things like FLE -> FILE, we started enabling codespell automatically on a bunch of repos and people have been fixing typos in their testing/dummy certs and then wonder why they are then broken/invalid

I don't think moving to code dictionary would work as likely fle is a typo.

looking at my specific example the cert has a line FLE+blah and FLE is being flagged. It seems like + is a delimiter like space so FLE is considered a word, but I wonder if it should be?

peternewman commented 2 years ago

ya its a bunch of 2/3 letters things like FLE -> FILE, we started enabling codespell automatically on a bunch of repos and people have been fixing typos in their testing/dummy certs and then wonder why they are then broken/invalid

Oh dear. I was going to suggest something clever for hex, then realised it's base 64 so that's a non-starter.

I don't think moving to code dictionary would work as likely fle is a typo.

Yeah agreed, again if it was just hex we could do clever stuff, but it's every typo.

looking at my specific example the cert has a line FLE+blah and FLE is being flagged. It seems like + is a delimiter like space so FLE is considered a word, but I wonder if it should be?

I think you want it to be, so you catch typos in your variables when you're doing foo+bar=baz.

I'm sort of ambivalent either way to this personally, perhaps we should have a straight vote; :+1: or :-1: on @clickthisnick first post in this topic as to whether we should change the default skip (when nothing is set) to include these types of files.

If we do so, we should probably make sure it logs the files its skipping by default, so we're not silently hiding some typos.

matkoniecz commented 2 years ago

If skipping would be automatically done: would there be any way to actually scan .pem/.crt files?

I see no overriding of skip in parameters (which could be useful BTW, thugh workaround of multiple codespell is also viable)

And codespell */**/*.crt would not scan crt file two folders deep.

clickthisnick commented 2 years ago

After reading the Jupyter notebook filter issue, having to maintain and include a bunch of custom file extensions in the core product would be annoying and time consuming.

For my usecase we had a script add the codespell config to repos (via pre-commit), we can def just ignore the specific extensions we have found to be problematic in our specific environment, rather than make this tool much more complicated

clickthisnick commented 2 years ago

I'm okay with closing this issue, and saying its up to the user to use the tool in the best way that they best see fit, rather than edit the tool to take a non intuitive action for each specific case

peternewman commented 2 years ago

If skipping would be automatically done: would there be any way to actually scan .pem/.crt files?

Possibly not with how it's written currently, but we could set things up so the default skip argument was to skip those two extensions (and maybe .git)? If you then supplied any skip argument, it would be cancelled, but you could skip them manually there, as well as what you wanted to skip.

After reading the Jupyter notebook filter issue, having to maintain and include a bunch of custom file extensions in the core product would be annoying and time consuming.

Personally I wouldn't be so against it for something like this, which has a far broader usage, at least in the sense nearly everyone uses certs, but perhaps not many people scan them with Codespell. I guess we need to work out if they are extensions to codespell (i.e. special processing via a module/function when it matches a particular type of file), or using codespell in external tools.

For my usecase we had a script add the codespell config to repos (via pre-commit), we can def just ignore the specific extensions we have found to be problematic in our specific environment, rather than make this tool much more complicated

That's great. You could also possibly look at an ignore regex to match the header, base64, footer pattern, which would still find typos elsewhere in those files.