dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
15.42k stars 4.75k forks source link

Add UTF8NoBom property to System.Text.Encoding class #2136

Open oliverjanik opened 4 years ago

oliverjanik commented 4 years ago

The System.Text.Encoding.UT8 encodes using BOM which is generally not recommended.

Can we have a UTF8NoBom property for convenience ?

MeikTranel commented 4 years ago

There's a constructor that lets you switch off the bom. See https://docs.microsoft.com/en-us/dotnet/api/system.text.utf8encoding.-ctor?view=netcore-3.1#System_Text_UTF8Encoding__ctor_System_Boolean_

GrabYourPitchforks commented 4 years ago

True, it's easy for somebody to construct this themselves. But a devil's advocate argument: we define an internal property UTF8NoBom in around a half-dozen places throughout our codebase. If we're doing this frequently then it stands to reason that our customers are doing it frequently as well.

MeikTranel commented 4 years ago

True. I'm in favor of this as well. It was more me making sure people immediately see that there is an easy workaround when they land on this page via google.

GrabYourPitchforks commented 4 years ago

@tarekgh Do you have any thoughts on this?

tarekgh commented 4 years ago

No objection here. I am just wondering, do we need to add properties like UnicodeNoBom and UTF32NoBom?

Minor issue, regarding naming. I think 'Bom' is ok but, is it better to use UTF8NoBOM? we did a similar thing when we used ISO and UTF. Or use the full name UTF8NoByteOrderMark.

GrabYourPitchforks commented 4 years ago

I don't know if there's much evidence that people use UnicodeEncoding or UTF32Encoding anywhere near as much as UTF8Encoding. But if we had such evidence then an accelerator on those might also make sense.

tarekgh commented 4 years ago

Unicode encoding usually used with System.IO. UTF32 is rarely used. We can wait not exposing any properties for these till we get a demand.

jkotas commented 4 years ago

we define an internal property UTF8NoBom in around a half-dozen

Most of these internal places define this property as NoBom + throwing. Do we want the encoding returned by this property to be non-throwing or throwing? Do we want to have both variants (throwing and non-throwing)?

tarekgh commented 4 years ago

I think providing both options (throwing and non-throwing) would be the best. but the priority would be for non-throwing as users will have more control.

augustoproiete commented 4 years ago

A simple UTF8NoBOM property should be non-throwing for consistency with the other properties in the Encoding class (UTF8, Unicode, BigEndianUnicode, etc...) which are all non-throwing.

My understanding is that a non-throwing instance is always used for reading (relaxed), whilst a throwing instance is always used for writing (strict), so it would be great if both options were available.

jkotas commented 4 years ago

a non-throwing instance is always used for reading

That is not correct. For example, the new JsonReader uses throwing to reject invalid UTF8 data: https://github.com/dotnet/runtime/blob/master/src/libraries/System.Text.Json/src/System/Text/Json/Reader/JsonReaderHelper.Unescaping.cs#L39

From security point of view, it is always better to use throwing variant because of it prevents invalid data from getting through the system.

I won't disagree if the simple property is non-throwing for consistency with the other properties.

augustoproiete commented 4 years ago

the new JsonReader uses throwing to reject invalid UTF8 data

Interesting, thanks. I read the comment "The high level goal is to be tolerant of encoding errors when we read and very strict when we write" in different files in the codebase (StreamReader/StreamWriter) and assumed it was a general principle.

GrabYourPitchforks commented 4 years ago

Repathing to future because we're still not sure if including this is the right thing for the product. The workaround is really simple: instantiate the class yourself with whatever arguments you require. The only real place we'd be able to add significant value is if we were to take advantage of JIT devirtualization or some other perf optimization that folks outside the runtime couldn't gain.

oliverjanik commented 2 months ago

Keep open please, I keep adding this property to every lib/project.

Seems like a common practice https://github.com/search?q=Utf8NoBom+language%3AC%23&type=code&l=C%23