PHPCSStandards / PHP_CodeSniffer

PHP_CodeSniffer tokenizes PHP files and detects violations of a defined set of coding standards.
BSD 3-Clause "New" or "Revised" License
978 stars 58 forks source link

Generic/LowercasedFilename: sniff doesn't handle non-ANSII characters properly #682

Open rodrigoprimo opened 1 week ago

rodrigoprimo commented 1 week ago

Describe the bug

While working on improving code coverage for the Generic.Files.LowercasedFilename sniff (#681), I noticed that it fails to properly handle file names that contain uppercase non-ANSII characters as it uses strtolower() to check if the filename is all lowercase. strtolower() ignores non-ANSII characters.

https://github.com/PHPCSStandards/PHP_CodeSniffer/blob/26ddb35f4684760b27ad48d3c420afb2c636cc1b/src/Standards/Generic/Sniffs/Files/LowercasedFilenameSniff.php#L51

Code sample

<?php

To reproduce

Steps to reproduce the behavior:

  1. Create a file called tÉst.php with the code sample above.
  2. Run phpcs tÉst.php --standard=Generic --sniffs=Generic.Files.LowercasedFilename
  3. No error message is displayed.

Expected behavior

PHPCS should display the following error message:

----------------------------------------------------------------------------------
FOUND 1 ERROR AFFECTING 1 LINE
----------------------------------------------------------------------------------
 1 | ERROR | Filename "tÉst.php" doesn't match the expected filename "tést.php"
----------------------------------------------------------------------------------

Versions (please complete the following information)

Operating System Ubuntu 24.04
PHP version 8.3
PHP_CodeSniffer version master
Standard Generic
Install type git clone

Please confirm

jrfnl commented 2 days ago

@rodrigoprimo Thanks for finding and reporting this issue.

While this is an interesting issue from a technical perspective, I consider this issue a low priority issue unless and until end-users of PHPCS would report they are running into it.

I wonder how common it is to have non-ASCII characters in file names ? I also have a gut-feeling files like that may not always be portable cross-OS, but this would need to be researched and confirmed/debunked first. If my gut-feeling would turn out to be correct, I can imagine non-ASCII characters in file names might deserve their own sniff (to forbid this).

I also wonder how we could detect this reliably as, while the file contents has an encoding, I don't know how we could figure out the encoding for the file name. I imagine the encoding might be based on the OS ? File name vs encoding is a curiosity which I've never dug into, so I'd be very interested to hear from someone who has and who can shed more light on this.