Reconsider including BOM in templates

dotnet / sdk

Core functionality needed to create .NET Core projects, that is shared between Visual Studio and CLI

https://dot.net/core

MIT License

2.66k stars 1.06k forks source link

Reconsider including BOM in templates #39187

Open richlander opened 6 months ago

richlander commented 6 months ago

It is unclear to me that there is any value in including these 3 bytes.

I wrote a quick program to demonstrate this:

FileStream file = File.Open(args[0], FileMode.Open,FileAccess.Read);

for (int i = 0; i < 10; i++)
{
    int b = file.ReadByte();
    Console.WriteLine($"{b}; {(char)b}");
}

What it produces:

rich@mazama:~/testbom$ dotnet run testbom.csproj 
239; ï
187; »
191; ¿
60; <
80; P
114; r
111; o
106; j
101; e
99; c
rich@mazama:~/testbom$ dotnet run Program.cs 
239; ï
187; »
191; ¿
70; F
105; i
108; l
101; e
83; S
116; t
114; r

What I see with cat:

See the leading space?

C# files have the same problem.

I also see the following in vim, which I use frequently for small edits.

It would be great to define guidance if we should include BOMs in any UTF8 files (C#, csproj, ...) by default. Hopefully not.

Ghostbird commented 6 months ago

I think it's a bit up to the end-user. In our company, we use the standard that all text files in our repositories are UTF-8, no-BOM, LF, with a final newline at the end. I personally think that's a good standard.

richlander commented 6 months ago

UTF-8, no-BOM, LF, with a final newline at the end

Are you saying that your files start with the linefeed character? Can you elaborate on that?

Ghostbird commented 6 months ago

Apologies for the confusion. I meant that our files use a linefeed character as line terminator.

bjornen77 commented 3 months ago

I think that it is good to use utf-8-bom as default in template code files for C#, VB and F#. The reasoning behind this is that Visual Studio(17.10.1) might use "wrong" encoding otherwise(Windows-1252 for example). I think that the default behaviour in VS should be changed to use utf-8 if BOM is missing. But as long as this is not the case, having the BOM is good for the following reasons:

1/ When opening some template code file that does not have a BOM in Visual Studio, it does not default to utf8. This will cause Visual Studio to raise the following error if characters that could not be saved using the current code page are added: https://github.com/dotnet/test-templates/issues/358

2/ But more important, there is a possibility that you get different behavior of your program when running on different systems if the file is not saved using utf8 or utf8-bom. https://github.com/dotnet/test-templates/issues/358

Also see this comment: https://github.com/dotnet/format/issues/1893#issuecomment-1946428275

In general, I think that using utf-8-bom for template code files is the best considering visual studios current encoding behaviour.

richlander commented 2 months ago

From the Unicode spec.

bjornen77 commented 2 months ago

Visual Studio by default choose the "wrong" encoding when opening template files stored without BOM. This will lead to several problems (https://github.com/dotnet/sdk/issues/39187#issuecomment-2147146329)

I think that the BOM helps Visual Studio to "guess" the correct encoding. Using the BOM as a signature seems ok according to the specification:

"Use of a BOM is neither required nor recommended for UTF-8, but may be encountered in contexts where UTF-8 data is converted from other encoding forms that use a BOM or where the BOM is used as a UTF-8 signature"

If Visual Studio changes to always default to UTF8, omitting the BOM would be fine. But until then, keeping the BOM would be the best.

Ghostbird commented 2 months ago

From the Unicode spec.

A bit off-topic, but keep in mind that an image is not strictly readable. I've spent a few minutes baffled why you only commented: "From the unicode spec." and nothing else. I only later realised that you'd attached an image containing the text.‌ I'm not (substantially) vision impaired, but my default e-mail set-up is plain-text and doesn't render images. Some people will not have the option to read images.

@bjornen77 Yeah, I think that this is the way to go. The templates should probably be most accessible to newcomers that expect a tutorial written for Visual Studio to work. For me fixing a repo because it's generated with wrong BOM usage is just a single command anyway.