DotNetAnalyzers / StyleCopAnalyzers

An implementation of StyleCop rules using the .NET Compiler Platform
MIT License
2.62k stars 506 forks source link

SA1412x rule for UTF-8 without BOM #2100

Open codybaxter opened 8 years ago

codybaxter commented 8 years ago

It would be nice to have a rule that is similar to SA1412 but checks for file encoding of UTF-8 without BOM.

sharwell commented 8 years ago

This rule is problematic in today's world of programming, where different developers are increasingly likely to have different default system encodings. Often, a document starts out containing only characters which are encoded the same way in multiple encodings (e.g. much English-language text is the same in Windows-1252 and UTF-8). While the document contains valid UTF-8 content, nothing stops Visual Studio (or some other editor) from opening the file as Windows-1252.

The suggested rule would catch cases where the document content contains byte sequences that are not valid UTF-8 sequences, but it would generally fail to instruct editors to open most documents as UTF-8. It would also fail to recognize cases where an editor saved a file in another encoding which is technically a valid UTF-8 byte sequence but did not preserve the original meaning of the characters.

The primary reason why it would make sense to leave out a BOM would be cases where editors, compilers, and/or libraries fail to correctly handle a BOM (notably the IO framework for Java passes the BOM through as a character). This is not the case in the world of C# code.

I would vote :-1: on the ability to enforce policies that remove a BOM, as it directly works against the more important goal of supporting developers working in a variety of local cultures.

henrygab commented 5 years ago

Should this be reconsidered? Examples of changes since last comments:

  1. Windows 10 now defaults to excluding the byte order marks for UTF-8 text files.
  2. Few (if any?) Linux variants default to including the UTF-8 BOM for text files.
  3. Visual Studio 2017 (and later?) defaults to excluding the UTF-8 BOM for new class files.

Therefore, it seems a rule that ensures UTF-8 files exclude the BOM would be beneficial, especially where a project uses or creates files that are also used by (for example) JAVA projects.